Spark Dataset transformations for time based events

2018-12-25 Thread Debajyoti Roy
Hope everyone is enjoying their holidays. If anyone here ran into these time based event transformation patterns or have a strong opinion about the approach please let me know / reply in SO: 1. Enrich using as-of-time:

Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-25 Thread Fawze Abujaber
http://shzhangji.com/blog/2015/05/31/spark-streaming-logging-configuration/ On Wed, Dec 26, 2018 at 1:05 AM shyla deshpande wrote: > Please point me to any documentation if available. Thanks > > On Tue, Dec 18, 2018 at 11:10 AM shyla deshpande > wrote: > >> Is there a way to do this without

Re: Tuning G1GC params for aggressive garbage collection?

2018-12-25 Thread Akshay Mendole
Hi, Yes. I did try increasing to the value of number of cores. It did not work as expected. I know System.gc is not a proper way. But, since this is a batch application, I'm okay if it spends more time in doing GC and takes considerably more cpu. Frequent system.gc calls with lower xmx (6 GB)

Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-25 Thread shyla deshpande
Please point me to any documentation if available. Thanks On Tue, Dec 18, 2018 at 11:10 AM shyla deshpande wrote: > Is there a way to do this without stopping the streaming application in > yarn cluster mode? > > On Mon, Dec 17, 2018 at 4:42 PM shyla deshpande > wrote: > >> I get the ERROR >>

Re: Packaging kafka certificates in uber jar

2018-12-25 Thread Anastasios Zouzias
Hi Colin, You can place your certificates under src/main/resources and include them in the uber JAR, see e.g. : https://stackoverflow.com/questions/40252652/access-files-in-resources-directory-in-jar-from-apache-spark-streaming-context Best, Anastasios On Mon, Dec 24, 2018 at 10:29 PM Colin

Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
It could be that “Spark” checks if each file after the job and with 1 files on HDFS it can take some time. I think this also is format specific (eg for parquet it does some checks) and does not occur with all formats. This time is not really highlighted in the UI (maybe worth to raise an

Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
Do you have a lot of small files? Do you use S3 or similar? It could be that Spark does some IO related tasks. > Am 25.12.2018 um 12:51 schrieb Akshay Mendole : > > Hi, > As you can see in the picture below, the application last job finished > at around 13:45 and I could see the output

Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Akshay Mendole
Yes. We have lot of small files (10 K files of around 100 MB each ) that we read from and write to HDFS. But the timeline shows, the jobs has completed quite some time ago and the output directory is also updated at that time. Thanks, Akshay On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke wrote: >

spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Akshay Mendole
Hi, As you can see in the picture below, the application last job finished at around 13:45 and I could see the output directory updated with the results. Yet, the application took a total of 20 min more to change the status. What could be the reason for this? Is this a known fact? The