Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Thanks Burak! Appreciate it. This makes sense.

How do you suggest we make sure resulting data doesn't produce tiny files?
If we are not on databricks yet and can not leverage delta lake features?
Also checkpointing feature, do you have active blog/article I can take
a look at to try out an example?

On Fri, May 1, 2020 at 7:22 PM Burak Yavuz  wrote:

> Hi Rishi,
>
> That is exactly why Trigger.Once was created for Structured Streaming. The
> way we look at streaming is that it doesn't have to be always real time, or
> 24-7 always on. We see streaming as a workflow that you have to repeat
> indefinitely. See this blog post for more details!
>
> https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
>
> Best,
> Burak
>
> On Fri, May 1, 2020 at 2:55 PM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I recently started playing with spark streaming, and checkpoint location
>> feature looks very promising. I wonder if anyone has an opinion about using
>> spark streaming with checkpoint location option as a slow batch processing
>> solution. What would be the pros and cons of utilizing streaming with
>> checkpoint location feature to achieve fault tolerance in batch processing
>> application?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah


Path style access fs.s3a.path.style.access property is not working in spark code

2020-05-01 Thread Aniruddha P Tekade
Hello Users,

I am using on-premise object storage and able to perform operations on
different bucket using aws-cli.
However, when I am trying to use the same path from my spark code, it
fails. Here are the details -

Addes dependencies in build.sbt -

   - hadoop-aws-2.7.4.ja
   - aws-java-sdk-1.7.4.jar

Spark Hadoop Configuration setup as -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

And now I try to write data into my custom s3 endpoint as follows -

val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
  dayofmonth(current_date()) as "day",
  month(current_date()) as "month",
  year(current_date()) as "year",
  column("time"),
  column("quality"),
  column("PM25"))
  .writeStream
  .partitionBy("year", "month", "day")
  .format("csv")
  .outputMode("append")
  .option("path",  "s3a://test-bucket/")
val streamingQuery: StreamingQuery = dataStreamWriter.start()


However, I am getting en error that AmazonHttpClient is not able to execute
HTTP request and
also it is referring to the bucket-name before the URL. Seems like the
hadoop configuration is not being resolved here -


20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request:
test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com


Is there anything that I am missing here in the configurations? Seems like
even after setting up path style access to true,
it's not working.

--
Aniruddha
---
ᐧ


Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The
way we look at streaming is that it doesn't have to be always real time, or
24-7 always on. We see streaming as a workflow that you have to repeat
indefinitely. See this blog post for more details!
https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah  wrote:

> Hi All,
>
> I recently started playing with spark streaming, and checkpoint location
> feature looks very promising. I wonder if anyone has an opinion about using
> spark streaming with checkpoint location option as a slow batch processing
> solution. What would be the pros and cons of utilizing streaming with
> checkpoint location feature to achieve fault tolerance in batch processing
> application?
>
> --
> Regards,
>
> Rishi Shah
>


[spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Hi All,

I recently started playing with spark streaming, and checkpoint location
feature looks very promising. I wonder if anyone has an opinion about using
spark streaming with checkpoint location option as a slow batch processing
solution. What would be the pros and cons of utilizing streaming with
checkpoint location feature to achieve fault tolerance in batch processing
application?

-- 
Regards,

Rishi Shah


Re: Spark job stuck at s3a-file-system metrics system started

2020-05-01 Thread Gourav Sengupta
Hi,

I think that we should stop using S3a, and use S3.

Please try refer about EMRFS and how it provides fantastic advantages :)


Regards,
Gourav Sengupta

On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade 
wrote:

> Hello,
>
> I am trying to run a spark job that is trying to write the data into a
> custom s3 endpoint bucket. But I am stuck at this line of output and job is
> not moving forward at all -
>
> 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
> 20/04/29 16:03:59 INFO SharedState: Warehouse path is 
> 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
> 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 
> 10 second(s).
> 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>
> After long time of waiting it shows this -
>
> org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
> test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP 
> request: Connect to s3-region0.mycloud.com:443 
> [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection 
> refused): Unable to execute HTTP request: Connect to 
> s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: 
> Connection refused (Connection refused)
>
> However, I am able to access this bucket from aws cli from the same
> machine. I don't understand why it is saying not able to execute the HTTP
> request.
>
> I am using -
>
> spark   3.0.0-preview2
> hadoop-aws  3.2.0
> aws-java-sdk-bundle 1.11.375
>
> My spark code has following properties set for hadoop configuration -
>
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
>
> Can someone help me in understanding what is wrong here? Is there anything
> else I need to configure. The custom s3-endpoint and its keys are valid and
> working from aws cli profile. What is wrong with the scala code here?
>
> val dataStreamWriter: DataStreamWriter[Row] = 
> PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) 
> as "month", year(current_date()) as "year")
>   .writeStream
>   .format("parquet")
>   .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
>   .outputMode("append")
>   .trigger(Trigger.ProcessingTime("15 seconds"))
>   .partitionBy("year", "month", "day")
>   .option("path", "s3a://test-bucket")
>
> val streamingQuery: StreamingQuery = dataStreamWriter.start()
>
> Aniruddha
> ---
> ᐧ
>


The new sock-puppet account sending the last few emails has been banned

2020-05-01 Thread Sean Owen



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



You shook hands with butchers of Gujarat now you are locked same as kashmir

2020-05-01 Thread Nelson Mandela
With end no in sight.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



You shook hands with butchers of Gujarat now you are locked same as kashmir

2020-05-01 Thread Nelson Mandela
With end no in sight.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Would Nelson Mandela work and make money while his people suffered from apartheid. You all do it.

2020-05-01 Thread Nelson Mandela
NO SHAME

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Hey crazy natzi Sean Owen do your job you incompetent useless pratt. You wrote you "subscribed " for this

2020-05-01 Thread Nelson Mandela



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Have you paid your bug bounty or did you log him off without paying

2020-05-01 Thread Nelson Mandela



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



TRUMP: clean hindutwa with an injection of DETTOL then grabbed the pussy in the locker room

2020-05-01 Thread Nelson Mandela



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Subscribe

2020-05-01 Thread Nelson Mandela



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org