Workarounds for OOM during serialization

2018-02-02 Thread J. McConnell
It would seem that I have hit SPARK-10787, an OOME during ClosureCleaner#ensureSerializable(). I am trying to run LSH over a SparseVector consisting of ~4M features with no more than 3K non-zero values per vector. I am hitting this OOME before even the hashes are calculated. I know the issue is

can we expect UUID type in Spark 2.3?

2018-02-02 Thread kant kodali
Hi All, can we expect UUID type in Spark 2.3? It looks like it can help lot of downstream sources to model. Thanks!

Running Spark 2.2.1 with extra packages

2018-02-02 Thread Conconscious
Hi list, I have a Spark cluster with 3 nodes. I'm calling spark-shell with some packages to connect to AWS S3 and Cassandra: spark-shell \   --packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4,datastax:spark-cassandra-connector:2.0.6-s_2.11 \   --conf

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark.  So I believe

[Spark Core] Limit the task duration (and kill it!)

2018-02-02 Thread Thomas Decaux
Hello, I would like to limit task duration to prevent big task such as « SELECT * FROM toto » , or limit the CPU-time, then kill the task/job. Is that possible ? (A kind of watch dog) Many thanks, Thomas Decaux

Re: spark 2.2.1

2018-02-02 Thread Mihai Iacob
Turns out it was the master recovery directory, that was messing things up. What was written there was on spark 2.0.2 and after replacing the master, the recovery process would fail with that error, but there were no clues that's what was happening.  

Re: spark 2.2.1

2018-02-02 Thread Bill Schwanitz
What version of java? On Feb 1, 2018 11:30 AM, "Mihai Iacob" wrote: > I am setting up a spark 2.2.1 cluster, however, when I bring up the master > and workers (both on spark 2.2.1) I get this error. I tried spark 2.2.0 and > get the same error. It works fine on spark 2.0.2.

Re: Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
I am using spark 2.1.0 On Fri, Feb 2, 2018 at 5:08 PM, Pralabh Kumar wrote: > Hi > > I am performing broadcast join where my small table is 1 gb . I am > getting following error . > > I am using > > > org.apache.spark.SparkException: > . Available: 0, required:

Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
Hi I am performing broadcast join where my small table is 1 gb . I am getting following error . I am using org.apache.spark.SparkException: . Available: 0, required: 28869232. To avoid this, increase spark.kryoserializer.buffer.max value I increase the value to

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-02-02 Thread Biplob Biswas
Great to hear 2 different viewpoints, and thanks a lot for your input Michael. For now, our application perform an etl process where it reads data from kafka and stores it in HBase and then performs basic enhancement and pushes data out on a kafka topic. We have a conflict of opinion here as few