[apache-spark] [spark-r] 503 Error - Cannot Connect to S3

2020-10-05 Thread Khatri, Faysal
Hello- I am attempting to use SparkR to read in a parquet file from S3. The exact same operation succeeds using PySpark - but I get a 503 error using SparkR. In fact, I get the 503 even if I use a bad endpoint or bad credentials. It's as if Spark isn't even trying to make the HTTP request.

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
Sure. My point was that Delta Lake is also one of the 3rd party libraries and there's no way for Apache Spark to do that. There's a Delta Lake's own group and the request is better to be there. On Mon, Oct 5, 2020 at 9:54 PM Enrico Minack wrote: > Though spark.read. refers to "built-in" data

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
Replied inline. On Tue, Oct 6, 2020 at 6:07 AM Sergey Oboguev wrote: > Hi Jungtaek, > > Thanks for your response. > > *> you'd want to dive inside the checkpoint directory and have separate > numbers per top-subdirectory* > > All the checkpoint store numbers are solely for the subdirectory set

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Sergey Oboguev
Hi Jungtaek, Thanks for your response. *> you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory* All the checkpoint store numbers are solely for the subdirectory set by option("checkpointLocation", .. checkpoint dir for writer ... ) Other

Re: Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj
Hi Jainshasha, I need to read each row from Dataframe and made some changes to it before inserting it into ES. Thanks Siva On Mon, Oct 5, 2020 at 8:06 PM jainshasha wrote: > Hi Siva > > To emit data into ES using spark structured streaming job you need to used > ElasticSearch jar which has

Re: Spark Streaming ElasticSearch

2020-10-05 Thread jainshasha
Hi Siva To emit data into ES using spark structured streaming job you need to used ElasticSearch jar which has support for sink for spark structured streaming job. For this you can use this one my branch where we have integrated ES with spark 3.0 and scala 2.12 compatible

Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj
Hi Team, I have a spark streaming job, which will read from kafka and write into elastic via Http request. I want to validate each request from Kafka and change the payload as per business need and write into Elastic Search. I have used ES Http Request to push the data into Elastic Search. Can

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Enrico Minack
Though spark.read. refers to "built-in" data sources, there is nothing that prevents 3rd party libraries to "extend" spark.read in Scala or Python. As users know the Spark-way to read built-in data sources, it feels natural to hook 3rd party data sources under the same scheme, to give users a

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
First of all, you'd want to divide these numbers by the number of micro-batches, as file creations in checkpoint directory would occur similarly per micro-batch. Second, you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory. After that we can see

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread Jungtaek Lim
Hi, That's not explained in the SS guide doc but explained in the scala API doc. http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html The statement being quoted from the scala API doc answers your question. The timeout is reset every time the function is

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
Hi, "spark.read." is a "shorthand" for "built-in" data sources, not for external data sources. spark.read.format() is still an official way to use it. Delta Lake is not included in Apache Spark so that is indeed not possible for Spark to refer to. Starting from Spark 3.0, the concept of

[Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Moser, Michael
Hi there, I'm just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet? For example, using PySpark, "spark.read.parquet" is available, but "spark.read.delta" is not (same for write). In my opinion,

Reading BigQuery data from Spark in Google Dataproc

2020-10-05 Thread Mich Talebzadeh
Hi, I have testest few JDBC BigQuery providers like Progress Direct and Simba but none of them seem to work properly through Spark. The only way I can read and write to BigQuery is through Spark BigQuery API using the following scenario spark-shell

Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread יורי אולייניקוב
Hi all, I have following question: What happens to the state (in terms of expiration) if I’m updating the state without setting timeout? E.g. in FlatMapGroupsWithStateFunction 1. first batch: state.update(myObj) state.setTimeoutDuration(timeout) 1. second batch: state.update(myObj)