strange behavior of spark 2.1.0

2017-04-01 Thread Jiang Jacky
Hello, Guys
I am running the spark streaming in 2.1.0, the scala version is tried on
2.11.7 and 2.11.4. And it is consuming from JMS. Recently, I have get the
following error
*"ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver"*

*This error can be occurred randomly, it might be couple hours or couple
days. besides this error, everything is perfect.*
When the error happens, my job is stopped completely. There is no any other
error can be found.
I am running on top of yarn, and tried to look up the error through yarn
logs, container, no any further information appears there. The job is just
stopped from driver gracefully. BTW I have customized receiver, I either do
not think it is happened from receiver, there is no any error exception
from receiver, and I can also track the stop command is sent from "onStop"
function in receiver.

FYI, the driver is not consuming any large memory, there is no any RDD
"collect" command in the driver. I have also checked container log for each
executor, and cannot find any further error.




The following is my conf for the spark context
val conf = new SparkConf().setAppName(jobName).setMaster(master)
  .set("spark.hadoop.validateOutputSpecs", "false")
  .set("spark.driver.allowMultipleContexts", "true")
  .set("spark.streaming.receiver.maxRate", "500")
  .set("spark.streaming.backpressure.enabled", "true")
  .set("spark.streaming.stopGracefullyOnShutdown", "true")
  .set("spark.eventLog.enabled", "true");

If you have any idea or suggestion, please let me know. Appreciate on the
solution.

Thank you so much


bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error
described here:

http://stackoverflow.com/questions/36798833/what-does-
exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh

In order to attempt to fix the problem, I set up my ipython environment
with the additional line:

PYTHONHASHSEED=1

When I fire up my ipython shell, and do:

In [7]: hash("foo")
Out[7]: -2457967226571033580

In [8]: hash("foo")
Out[8]: -2457967226571033580

So my hash function is now seeded so it returns consistent values. But when
I do a groupByKey(), I get the same error:


Exception: Randomness of hash of string should be disabled via
PYTHONHASHSEED

Anyone know how to fix this problem in python 3.4?

Thanks

Henry

-- 
Paul Henry Tremblay
Robert Half Technology


pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error
described here:

http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh

In order to attempt to fix the problem, I set up my ipython environment
with the additional line:

PYTHONHASHSEED=1

When I fire up my ipython shell, and do:

In [7]: hash("foo")
Out[7]: -2457967226571033580

In [8]: hash("foo")
Out[8]: -2457967226571033580

So my hash function is now seeded so it returns consistent values. But when
I do a groupByKey(), I get the same error:


Exception: Randomness of hash of string should be disabled via
PYTHONHASHSEED

Anyone know how to fix this problem in python 3.4?

Thanks

Henry


-- 
Paul Henry Tremblay
Robert Half Technology


getting error while storing data in Hbase

2017-04-01 Thread Chintan Bhatt
Hello all,
I'm running following command in Hbase shell:

create "sample","cf"

and getting following error


ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able
to connect to ZooKeeper but the connection closes immediately. This could
be a sign that the server has too many connections (30 is the default).
Consider inspecting your ZK server logs for that error and then make sure
you are reusing HBaseConfiguration as often as you can. See HTable's
javadoc for more information.




please provide me solution for the same

-- 
CHINTAN BHATT 
Assistant Professor,
U & P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Charotar University of Science And Technology (CHARUSAT),
Changa-388421, Gujarat, INDIA.
http://www.charusat.ac.in
*Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/

-- 


DISCLAIMER: The information transmitted is intended only for the person or 
entity to which it is addressed and may contain confidential and/or 
privileged material which is the intellectual property of Charotar 
University of Science & Technology (CHARUSAT). Any review, retransmission, 
dissemination or other use of, or taking of any action in reliance upon 
this information by persons or entities other than the intended recipient 
is strictly prohibited. If you are not the intended recipient, or the 
employee, or agent responsible for delivering the message to the intended 
recipient and/or if you have received this in error, please contact the 
sender and delete the material from the computer or device. CHARUSAT does 
not take any liability or responsibility for any malicious codes/software 
and/or viruses/Trojan horses that may have been picked up during the 
transmission of this message. By opening and solely relying on the contents 
or part thereof this message, and taking action thereof, the recipient 
relieves the CHARUSAT of all the liabilities including any damages done to 
the recipient's pc/laptop/peripherals and other communication devices due 
to any reason.


Cuesheet - spark deployment

2017-04-01 Thread Deepu Raj
Hi Team,


Trying to use cuesheet for spark deployment . Getting the following error on 
Hortonworks VM:-



2017-04-01 23:33:45 WARN  DFSClient - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/user/hdfs/.cuesheet/lib/0.10.0-scala-2.11-spark-2.1.0/spark-yarn_2.11-2.1.0.jar
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and 1 node(s) are excluded in this operation.



"C:\Program Files\Java\jdk1.8.0_121\bin\java" -Didea.launcher.port=7537 
"-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA 
Community Edition 2016.2.4\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\charsets.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\deploy.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\access-bridge-64.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\cldrdata.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\dnsns.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\jaccess.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\jfxrt.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\localedata.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\nashorn.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\sunec.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\sunjce_provider.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\sunmscapi.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\sunpkcs11.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\ext\zipfs.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\javaws.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\jce.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\jfr.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\jfxswt.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\jsse.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\management-agent.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\plugin.jar;C:\Program 
Files\Java\jdk1.8.0_121\jre\lib\resources.jar;C:\Program 

Convert Dataframe to Dataset in pyspark

2017-04-01 Thread Selvam Raman
In Scala,
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]

what is the equivalent code in pyspark?

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


[Spark Core]: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

2017-04-01 Thread Richard Tsai
Hi all,

I'm using Spark to process some corpora and I need to count the occurrence
of each 2-gram. I started with counting tuples (wordID1, wordID2) and it
worked fine except the large memory usage and gc overhead due to the
substantial number of small tuple objects. Then I tried to pack a pair of
Int into a Long, and the gc overhead did reduce greatly, but the run time
also increased several times.

I ran some small experiments with random data on different distributions.
It seems that the performance issue only occurs on exponential distributed
data. The example code is attached.

The job is split into two stages, flatMap() and count(). When counting
Tuples, flatMap() takes about 6s and count() takes about 2s, while when
counting Longs, flatMap() takes 18s and count() takes 10s.

I haven't look into Spark's implementation of flatMap/reduceByKey, but I
guess Spark has some specializations for Long keys which happen to perform
not very well on some specific distributions. Does anyone have ideas about
this?

Best wishes,
Richard

// lines of word IDs
val data = (1 to 5000).par.map({ _ =>
  (1 to 1000) map { _ => (-1000 * Math.log(Random.nextDouble)).toInt }
}).seq

// count Tuples, fast
sc parallelize(data) flatMap { line =>
  val first = line.iterator
  val second = line.iterator.drop(1)
  for (pair <- first zip(second))
yield (pair, 1L)
} reduceByKey { _ + _ } count()

// count Long, slow
sc parallelize(data) flatMap { line =>
  val first = line.iterator
  val second = line.iterator.drop(1)
  for ((a, b) <- first zip(second))
yield ((a.toLong << 32) | b, 1L)
} reduceByKey { _ + _ } count()