spark job delay when starting

2020-07-21 Thread Bulldog20630405
when running spark jobs we find when running the following command: top -H -i -p showed that a single thread labeled "map-output-disp" was running at 99.7% for a majority of the delay period. this delay gets progressively worse with the increase in partition count. it seems the delay comes from

Spark Structured Streaming join data results in missing result set

2020-07-21 Thread dong524dong
We are using Spark structured streaming to make the join association between two data streams. Use Kafka to collect data in the earliest way (the sender sends data cyclically, sending only one data message at a time). The following are our kafka configuration parameters: def

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Gourav Sengupta
Hi, I am not sure about this but is there any requirement to use S3a at all ? Regards, Gourav On Tue, Jul 21, 2020 at 12:07 PM Steve Loughran wrote: > > > On Tue, 7 Jul 2020 at 03:42, Stephen Coy > wrote: > >> Hi Steve, >> >> While I understand your point regarding the mixing of Hadoop

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
On Tue, 7 Jul 2020 at 03:42, Stephen Coy wrote: > Hi Steve, > > While I understand your point regarding the mixing of Hadoop jars, this > does not address the java.lang.ClassNotFoundException. > > Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or > Hadoop 3.2. Not Hadoop 3.1.

Re: Future timeout

2020-07-21 Thread Dhaval Patel
Just a suggestion, Looks like its timing out when you are broadcasting big object. Generally its not advisable to do so, if you can get rid of that, program may behave consistent. On Tue, Jul 21, 2020 at 3:17 AM Piyush Acharya wrote: > spark.conf.set("spark.sql.broadcastTimeout", ##) > >

Refreshing static data with streaming data at regular Intervals

2020-07-21 Thread Debabrata Ghosh
Hi All, We have a Static DataFrame with as follows. -- id|time_stamp| -- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -- We also have live stream of events, a Streaming DataFrame which contains id and updated

Re: Need your help!! (URGENT Code works fine when submitted as java main but part of data missing when running as Spark-Submit)

2020-07-21 Thread Pasha Finkelshteyn
Hi Rachana, Couls you please provide us with mre details: Minimal repro Spark version Java version Scala version On 20/07/21 08:27AM, Rachana Srivastava wrote: > I am unable to identify the root cause of why my code is missing data when I > run as spark-submit but the code works fine when I

Need your help!! (URGENT Code works fine when submitted as java main but part of data missing when running as Spark-Submit)

2020-07-21 Thread Rachana Srivastava
I am unable to identify the root cause of why my code is missing data when I run as spark-submit but the code works fine when I run as java mainĀ  Any idea

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-21 Thread Ben Smith
I can also recreate with the very latest master branch (3.1.0-SNAPSHOT) if I compile it locally -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Future timeout

2020-07-21 Thread Piyush Acharya
spark.conf.set("spark.sql.broadcastTimeout", ##) On Mon, Jul 20, 2020 at 11:51 PM Amit Sharma wrote: > Please help on this. > > > Thanks > Amit > > On Fri, Jul 17, 2020 at 9:10 AM Amit Sharma wrote: > >> Hi, sometimes my spark streaming job throw this exception Futures timed >> out after