date:20201005

Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread יורי אולייניקוב

Hi all, I have following question: What happens to the state (in terms of expiration) if I’m updating the state without setting timeout? E.g. in FlatMapGroupsWithStateFunction 1. first batch: state.update(myObj) state.setTimeoutDuration(timeout) 1. second batch: state.update(myObj)

Reading BigQuery data from Spark in Google Dataproc

2020-10-05 Thread Mich Talebzadeh

Hi, I have testest few JDBC BigQuery providers like Progress Direct and Simba but none of them seem to work properly through Spark. The only way I can read and write to BigQuery is through Spark BigQuery API using the following scenario spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-l

[Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Moser, Michael

Hi there, I'm just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet? For example, using PySpark, "spark.read.parquet" is available, but "spark.read.delta" is not (same for write). In my opinion, "spark.r

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim

Hi, "spark.read." is a "shorthand" for "built-in" data sources, not for external data sources. spark.read.format() is still an official way to use it. Delta Lake is not included in Apache Spark so that is indeed not possible for Spark to refer to. Starting from Spark 3.0, the concept of "catalog"

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread Jungtaek Lim

Hi, That's not explained in the SS guide doc but explained in the scala API doc. http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html The statement being quoted from the scala API doc answers your question. The timeout is reset every time the function is c

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim

First of all, you'd want to divide these numbers by the number of micro-batches, as file creations in checkpoint directory would occur similarly per micro-batch. Second, you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory. After that we can see whether

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Enrico Minack

Though spark.read. refers to "built-in" data sources, there is nothing that prevents 3rd party libraries to "extend" spark.read in Scala or Python. As users know the Spark-way to read built-in data sources, it feels natural to hook 3rd party data sources under the same scheme, to give users a h

Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj

Hi Team, I have a spark streaming job, which will read from kafka and write into elastic via Http request. I want to validate each request from Kafka and change the payload as per business need and write into Elastic Search. I have used ES Http Request to push the data into Elastic Search. Can s

Re: Spark Streaming ElasticSearch

2020-10-05 Thread jainshasha

Hi Siva To emit data into ES using spark structured streaming job you need to used ElasticSearch jar which has support for sink for spark structured streaming job. For this you can use this one my branch where we have integrated ES with spark 3.0 and scala 2.12 compatible https://github.com/Thales

Re: Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj

Hi Jainshasha, I need to read each row from Dataframe and made some changes to it before inserting it into ES. Thanks Siva On Mon, Oct 5, 2020 at 8:06 PM jainshasha wrote: > Hi Siva > > To emit data into ES using spark structured streaming job you need to used > ElasticSearch jar which has sup

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Sergey Oboguev

Hi Jungtaek, Thanks for your response. *> you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory* All the checkpoint store numbers are solely for the subdirectory set by option("checkpointLocation", .. checkpoint dir for writer ... ) Other subdirector

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim

Replied inline. On Tue, Oct 6, 2020 at 6:07 AM Sergey Oboguev wrote: > Hi Jungtaek, > > Thanks for your response. > > *> you'd want to dive inside the checkpoint directory and have separate > numbers per top-subdirectory* > > All the checkpoint store numbers are solely for the subdirectory set b

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim

Sure. My point was that Delta Lake is also one of the 3rd party libraries and there's no way for Apache Spark to do that. There's a Delta Lake's own group and the request is better to be there. On Mon, Oct 5, 2020 at 9:54 PM Enrico Minack wrote: > Though spark.read. refers to "built-in" data sou

[apache-spark] [spark-r] 503 Error - Cannot Connect to S3

2020-10-05 Thread Khatri, Faysal

Hello- I am attempting to use SparkR to read in a parquet file from S3. The exact same operation succeeds using PySpark - but I get a 503 error using SparkR. In fact, I get the 503 even if I use a bad endpoint or bad credentials. It's as if Spark isn't even trying to make the HTTP request. It's

Arbitrary stateful aggregation: updating state without setting timeout

Reading BigQuery data from Spark in Google Dataproc

[Spark Core] Why no spark.read.delta / df.write.delta?

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Re: Arbitrary stateful aggregation: updating state without setting timeout

Re: Excessive disk IO with Spark structured streaming

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Spark Streaming ElasticSearch

Re: Spark Streaming ElasticSearch

Re: Spark Streaming ElasticSearch

Re: Excessive disk IO with Spark structured streaming

Re: Excessive disk IO with Spark structured streaming

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

[apache-spark] [spark-r] 503 Error - Cannot Connect to S3

14 matches

Site Navigation

Mail list logo

Footer information