intermittent Kryo serialization failures in Spark

2019-07-10 Thread Jerry Vinokurov
Hi all, I am experiencing a strange intermittent failure of my Spark job that results from serialization issues in Kryo. Here is the stack trace: Caused by: java.lang.ClassNotFoundException: com.mycompany.models.MyModel > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >

Problems running TPC-H on Raspberry Pi Cluster

2019-07-10 Thread agg212
We are trying to benchmark TPC-H (scale factor 1) on a 13-node Raspberry Pi 3B+ cluster (1 master, 12 workers). Each node has 1GB of RAM and a quad-core processor, running Ubuntu Server 18.04. The cluster is using the Spark standalone scheduler with the *.tbl files from TPCH’s dbgen tool stored in

Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Silvio Fiorito
It really depends on the use case. Bucketing is storing the data already hash-partitioned. So, if you frequently perform aggregations or joins on the bucketing column(s) then it can save you a shuffle. You need to keep in mind that for joins to completely avoid a shuffle both tables would need t

Re: Set TimeOut and continue with other tasks

2019-07-10 Thread Wei Chen
I am currently trying to use Future Await to set a timeout inside the map-reduce. However, the tasks now fail instead of stuck, even if I have a Try Match to catch it. Doesn't anyone have an idea why? The code is like ```Scala files.map { file => Try { def tmpFunc(): Boolean = { FILE CONVER

Re: Spark structural streaming sinks output late

2019-07-10 Thread Magnus Nilsson
Well, you should get updates every 10 seconds as long as there are events surviving your quite aggressive watermark condition. Spark will try to drop (not guaranteed) all events with a timestamp more than 500 milliseconds before the current watermark timestamp. Try to increase the watermark timespa

RE: Spark structural streaming sinks output late

2019-07-10 Thread Kamalanathan Venkatesan
Hello, Any observations on what am I doing wrong? Thanks, -Kamal From: Kamalanathan Venkatesan Sent: Tuesday, July 09, 2019 7:25 PM To: 'user@spark.apache.org' Subject: Spark structural streaming sinks output late Hello, I have below spark structural streaming code and I was expecting the res

Re: Set TimeOut and continue with other tasks

2019-07-10 Thread Gourav Sengupta
Is there a way you can identify those patterns in a file or in its name and then just tackle them in separate jobs? I use the function input_file_name() to find the name of input file of each record and then filter out certain files. Regards, Gourav On Wed, Jul 10, 2019 at 6:47 AM Wei Chen wrote

Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Gourav Sengupta
yeah makes sense, also is there any massive performance improvement using bucketBy in comparison to sorting? Regards, Gourav On Thu, Jul 4, 2019 at 1:34 PM Silvio Fiorito wrote: > You need to first repartition (at a minimum by bucketColumn1) since each > task will write out the buckets/files. I

Re: Pass row to UDF and select column based on pattern match

2019-07-10 Thread Gourav Sengupta
Just out of curiosity why are you trying to use a UDF when the corresponding function is available? Or why not use SQL instead? Regards, Gourav On Tue, Jul 9, 2019 at 7:25 PM Femi Anthony wrote: > How can I achieve the following by passing a row to a udf ? > > > *val df1 = df.withColumn("co