Re: [spark streaming] checkpoint location feature for batch processing
Thanks Burak! Appreciate it. This makes sense. How do you suggest we make sure resulting data doesn't produce tiny files? If we are not on databricks yet and can not leverage delta lake features? Also checkpointing feature, do you have active blog/article I can take a look at to try out an example? On Fri, May 1, 2020 at 7:22 PM Burak Yavuz wrote: > Hi Rishi, > > That is exactly why Trigger.Once was created for Structured Streaming. The > way we look at streaming is that it doesn't have to be always real time, or > 24-7 always on. We see streaming as a workflow that you have to repeat > indefinitely. See this blog post for more details! > > https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html > > Best, > Burak > > On Fri, May 1, 2020 at 2:55 PM Rishi Shah > wrote: > >> Hi All, >> >> I recently started playing with spark streaming, and checkpoint location >> feature looks very promising. I wonder if anyone has an opinion about using >> spark streaming with checkpoint location option as a slow batch processing >> solution. What would be the pros and cons of utilizing streaming with >> checkpoint location feature to achieve fault tolerance in batch processing >> application? >> >> -- >> Regards, >> >> Rishi Shah >> > -- Regards, Rishi Shah
Path style access fs.s3a.path.style.access property is not working in spark code
Hello Users, I am using on-premise object storage and able to perform operations on different bucket using aws-cli. However, when I am trying to use the same path from my spark code, it fails. Here are the details - Addes dependencies in build.sbt - - hadoop-aws-2.7.4.ja - aws-java-sdk-1.7.4.jar Spark Hadoop Configuration setup as - spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT); spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY); spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY); spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") And now I try to write data into my custom s3 endpoint as follows - val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select( dayofmonth(current_date()) as "day", month(current_date()) as "month", year(current_date()) as "year", column("time"), column("quality"), column("PM25")) .writeStream .partitionBy("year", "month", "day") .format("csv") .outputMode("append") .option("path", "s3a://test-bucket/") val streamingQuery: StreamingQuery = dataStreamWriter.start() However, I am getting en error that AmazonHttpClient is not able to execute HTTP request and also it is referring to the bucket-name before the URL. Seems like the hadoop configuration is not being resolved here - 20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com Is there anything that I am missing here in the configurations? Seems like even after setting up path style access to true, it's not working. -- Aniruddha --- ᐧ
Re: [spark streaming] checkpoint location feature for batch processing
Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details! https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html Best, Burak On Fri, May 1, 2020 at 2:55 PM Rishi Shah wrote: > Hi All, > > I recently started playing with spark streaming, and checkpoint location > feature looks very promising. I wonder if anyone has an opinion about using > spark streaming with checkpoint location option as a slow batch processing > solution. What would be the pros and cons of utilizing streaming with > checkpoint location feature to achieve fault tolerance in batch processing > application? > > -- > Regards, > > Rishi Shah >
[spark streaming] checkpoint location feature for batch processing
Hi All, I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application? -- Regards, Rishi Shah
Re: Spark job stuck at s3a-file-system metrics system started
Hi, I think that we should stop using S3a, and use S3. Please try refer about EMRFS and how it provides fantastic advantages :) Regards, Gourav Sengupta On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade wrote: > Hello, > > I am trying to run a spark job that is trying to write the data into a > custom s3 endpoint bucket. But I am stuck at this line of output and job is > not moving forward at all - > > 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'). > 20/04/29 16:03:59 INFO SharedState: Warehouse path is > 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'. > 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at > 10 second(s). > 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system > started > > After long time of waiting it shows this - > > org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on > test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP > request: Connect to s3-region0.mycloud.com:443 > [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection > refused): Unable to execute HTTP request: Connect to > s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: > Connection refused (Connection refused) > > However, I am able to access this bucket from aws cli from the same > machine. I don't understand why it is saying not able to execute the HTTP > request. > > I am using - > > spark 3.0.0-preview2 > hadoop-aws 3.2.0 > aws-java-sdk-bundle 1.11.375 > > My spark code has following properties set for hadoop configuration - > > spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") > > Can someone help me in understanding what is wrong here? Is there anything > else I need to configure. The custom s3-endpoint and its keys are valid and > working from aws cli profile. What is wrong with the scala code here? > > val dataStreamWriter: DataStreamWriter[Row] = > PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) > as "month", year(current_date()) as "year") > .writeStream > .format("parquet") > .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/") > .outputMode("append") > .trigger(Trigger.ProcessingTime("15 seconds")) > .partitionBy("year", "month", "day") > .option("path", "s3a://test-bucket") > > val streamingQuery: StreamingQuery = dataStreamWriter.start() > > Aniruddha > --- > ᐧ >
The new sock-puppet account sending the last few emails has been banned
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
You shook hands with butchers of Gujarat now you are locked same as kashmir
With end no in sight. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
You shook hands with butchers of Gujarat now you are locked same as kashmir
With end no in sight. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Would Nelson Mandela work and make money while his people suffered from apartheid. You all do it.
NO SHAME - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hey crazy natzi Sean Owen do your job you incompetent useless pratt. You wrote you "subscribed " for this
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Have you paid your bug bounty or did you log him off without paying
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
TRUMP: clean hindutwa with an injection of DETTOL then grabbed the pussy in the locker room
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Subscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org