Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread kant kodali
I am currently using Spark 2.3.0. Will try it with 2.3.1 On Tue, Jul 3, 2018 at 3:12 PM, Shixiong(Ryan) Zhu wrote: > Which version are you using? There is a known issue regarding this and > should be fixed in 2.3.1. See https://issues.apache.org/ > jira/browse/SPARK-23623 for details. > > Best R

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread Shixiong(Ryan) Zhu
Which version are you using? There is a known issue regarding this and should be fixed in 2.3.1. See https://issues.apache.org/jira/browse/SPARK-23623 for details. Best Regards, Ryan On Mon, Jul 2, 2018 at 3:56 AM, kant kodali wrote: > Hi All, > > I get the below error quite often when I do an

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread Jungtaek Lim
Could you please describe the version of Spark, and how did you run your app? If you don’t mind to share minimal app which can reproduce this, it would be really great. - Jungtaek Lim (HeartSaVioR) On Mon, 2 Jul 2018 at 7:56 PM kant kodali wrote: > Hi All, > > I get the below error quite often w

Number of records per micro-batch in DStream vs Structured Streaming

2018-07-03 Thread subramgr
Hi, We have 2 spark streaming job one using DStreams and the other using Structured Streaming. I have observed that the number of records per micro-batch (Per Trigger in case of Structured Streaming) is not the same between the two jobs. The Structured Streaming job has higher numbers compared to

Re: [Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-03 Thread subramgr
Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Building SparkML vectors from long data

2018-07-03 Thread Patrick McCarthy
I'm still validating my results, but my solution for the moment looks like the below. I'm presently dealing with one-hot encoded values, so all the numbers in my array are 1: def udfMaker(feature_len): return F.udf(lambda x: SparseVector(feature_len, sorted(x), [1.0]*len(x)), VectorUDT()) in

Re: [G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
Thanks a ton! On Tue, Jul 3, 2018 at 6:26 PM, Vadim Semenov wrote: > As typical `JAVA_OPTS` you need to pass as a single parameter: > > --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:-ResizePLAB" > > Also you got an extra space in the parameter, there should be no space > after the col

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Vadim Semenov
You can't change the executor/driver cores/memory on the fly once you've already started a Spark Context. On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu wrote: > > We aren't using Oozie or similar, moreover, the end to end job shall be > exactly the same, but the data will be extremely different (num

Re: [G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Vadim Semenov
As typical `JAVA_OPTS` you need to pass as a single parameter: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:-ResizePLAB" Also you got an extra space in the parameter, there should be no space after the colon symbol On Tue, Jul 3, 2018 at 3:01 AM Aakash Basu wrote: > > Hi, > > I used

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev, I would like to pass Python user defined function to Spark Job developed using Scala and return value of that function would be returned to DF / Dataset API. Can someone please guide me, which would be best approach to do this. Python function would be mostly transfor

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
We aren't using Oozie or similar, moreover, the end to end job shall be exactly the same, but the data will be extremely different (number of continuous and categorical columns, vertical size, horizontal size, etc), hence, if there would have been a calculation of the parameters to arrive at a conc

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Jörn Franke
Don’t do this in your job. Create for different types of jobs different jobs and orchestrate them using oozie or similar. > On 3. Jul 2018, at 09:34, Aakash Basu wrote: > > Hi, > > Cluster - 5 node (1 Driver and 4 workers) > Driver Config: 16 cores, 32 GB RAM > Worker Config: 8 cores, 16 GB RA

Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
Hi, Cluster - 5 node (1 Driver and 4 workers) Driver Config: 16 cores, 32 GB RAM Worker Config: 8 cores, 16 GB RAM I'm using the below parameters from which I know the first chunk is cluster dependent and the second chunk is data/code dependent. --num-executors 4 --executor-cores 5 --executor-me

[G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
Hi, I used the below in the Spark Submit for using G1GC - --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" Now, I want to use *-XX: -ResizePLAB *of the G1GC to control to avoid performance degradation caused by a large number of thread communications. How to do it? I tried submitting in th