Hi all,
I am experiencing a strange intermittent failure of my Spark job that
results from serialization issues in Kryo. Here is the stack trace:
Caused by: java.lang.ClassNotFoundException: com.mycompany.models.MyModel
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>
We are trying to benchmark TPC-H (scale factor 1) on a 13-node Raspberry Pi
3B+ cluster (1 master, 12 workers). Each node has 1GB of RAM and a quad-core
processor, running Ubuntu Server 18.04. The cluster is using the Spark
standalone scheduler with the *.tbl files from TPCH’s dbgen tool stored in
It really depends on the use case. Bucketing is storing the data already
hash-partitioned. So, if you frequently perform aggregations or joins on the
bucketing column(s) then it can save you a shuffle. You need to keep in mind
that for joins to completely avoid a shuffle both tables would need t
I am currently trying to use Future Await to set a timeout inside the
map-reduce.
However, the tasks now fail instead of stuck, even if I have a Try Match to
catch it.
Doesn't anyone have an idea why?
The code is like
```Scala
files.map { file =>
Try {
def tmpFunc(): Boolean = { FILE CONVER
Well, you should get updates every 10 seconds as long as there are events
surviving your quite aggressive watermark condition. Spark will try to drop
(not guaranteed) all events with a timestamp more than 500 milliseconds
before the current watermark timestamp. Try to increase the watermark
timespa
Hello,
Any observations on what am I doing wrong?
Thanks,
-Kamal
From: Kamalanathan Venkatesan
Sent: Tuesday, July 09, 2019 7:25 PM
To: 'user@spark.apache.org'
Subject: Spark structural streaming sinks output late
Hello,
I have below spark structural streaming code and I was expecting the res
Is there a way you can identify those patterns in a file or in its name and
then just tackle them in separate jobs? I use the function
input_file_name() to find the name of input file of each record and then
filter out certain files.
Regards,
Gourav
On Wed, Jul 10, 2019 at 6:47 AM Wei Chen wrote
yeah makes sense, also is there any massive performance improvement using
bucketBy in comparison to sorting?
Regards,
Gourav
On Thu, Jul 4, 2019 at 1:34 PM Silvio Fiorito
wrote:
> You need to first repartition (at a minimum by bucketColumn1) since each
> task will write out the buckets/files. I
Just out of curiosity why are you trying to use a UDF when the
corresponding function is available? Or why not use SQL instead?
Regards,
Gourav
On Tue, Jul 9, 2019 at 7:25 PM Femi Anthony wrote:
> How can I achieve the following by passing a row to a udf ?
>
>
> *val df1 = df.withColumn("co