strange behavior of spark 2.1.0

2017-04-01 Thread Jiang Jacky
Hello, Guys I am running the spark streaming in 2.1.0, the scala version is tried on 2.11.7 and 2.11.4. And it is consuming from JMS. Recently, I have get the following error *"ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver"* *This error can be occurred

bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does- exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with

pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with the

getting error while storing data in Hbase

2017-04-01 Thread Chintan Bhatt
Hello all, I'm running following command in Hbase shell: create "sample","cf" and getting following error ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many

Cuesheet - spark deployment

2017-04-01 Thread Deepu Raj
Hi Team, Trying to use cuesheet for spark deployment . Getting the following error on Hortonworks VM:- 2017-04-01 23:33:45 WARN DFSClient - DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File

Convert Dataframe to Dataset in pyspark

2017-04-01 Thread Selvam Raman
In Scala, val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String] what is the equivalent code in pyspark? -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

[Spark Core]: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

2017-04-01 Thread Richard Tsai
Hi all, I'm using Spark to process some corpora and I need to count the occurrence of each 2-gram. I started with counting tuples (wordID1, wordID2) and it worked fine except the large memory usage and gc overhead due to the substantial number of small tuple objects. Then I tried to pack a pair