date:20200501

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah

Thanks Burak! Appreciate it. This makes sense. How do you suggest we make sure resulting data doesn't produce tiny files? If we are not on databricks yet and can not leverage delta lake features? Also checkpointing feature, do you have active blog/article I can take a look at to try out an

Path style access fs.s3a.path.style.access property is not working in spark code

2020-05-01 Thread Aniruddha P Tekade

Hello Users, I am using on-premise object storage and able to perform operations on different bucket using aws-cli. However, when I am trying to use the same path from my spark code, it fails. Here are the details - Addes dependencies in build.sbt - - hadoop-aws-2.7.4.ja -

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz

Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

[spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah

Hi All, I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming

Re: Spark job stuck at s3a-file-system metrics system started

2020-05-01 Thread Gourav Sengupta

Hi, I think that we should stop using S3a, and use S3. Please try refer about EMRFS and how it provides fantastic advantages :) Regards, Gourav Sengupta On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade wrote: > Hello, > > I am trying to run a spark job that is trying to write the data