updateBytesRead()

2019-03-01 Thread swastik mittal
Hi, In Spark source code, Hadoop.scala (in RDD). Spark updates the information of total bytes read after every 1000 records. Displaying the bytes read along side the update function it shows 65536. Even if I change the code to update bytes read after every record it, it still shows 65536 multiple

SPARK Streaming Graphs

2019-03-01 Thread Gourav Sengupta
Hi, earlier SPARK UI had the streaming stats, but for some reason it was discontinued in later versions of open source, though Databricks does provide it internally. I will be grateful if users could kindly share their learnings, and approaches to show these graphs with structured streaming.

Re: [Spark SQL]: sql.DataFrame.replace to accept regexp

2019-03-01 Thread Gourav Sengupta
Hi, why not just use regexp_replace() unless there is an attachment to the function replace ofcourse, which is quite understandable :) Regards, Gourav On Fri, Mar 1, 2019 at 2:39 PM Richard Garris wrote: > You can file a feature request at > > https://issues.apache.org/jira/projects/SPARK/

Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Tomasz Krol
Yeah, seems like that option with making emptyDir larger is something that we need to consider. Cheers Tomasz Krol On Fri, 1 Mar 2019 at 19:30, Matt Cheah wrote: > Ah I see: We always force the local directory to use emptyDir and it > cannot be configured to use any other volume type. See

Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Matt Cheah
Ah I see: We always force the local directory to use emptyDir and it cannot be configured to use any other volume type. See here. I am a bit conflicted on this. On one hand, it makes sense to allow for users to be able to mount their own volumes to handle spill data. On the other hand, I

Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Tomasz Krol
Hi Matt, Thanks for coming back to me. Yeah that doesn't work. Basically in the properties I set Volume and mounting point as below; spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint

Is there a way to validate the syntax of raw spark sql query?

2019-03-01 Thread kant kodali
Hi All, Is there a way to validate the syntax of raw spark SQL query? for example, I would like to know if there is any isValid API call spark provides? val query = "select * from table"if(isValid(query)) { sparkSession.sql(query) } else { log.error("Invalid Syntax")} I tried the

Re: Spark 2.4 Structured Streaming Kafka assign API polling same offsets

2019-03-01 Thread Kristopher Kane
I figured out why. We are not persisting the data at the end of .load(). Thus, every operation like count() is going back to Kafka for the data again. On Fri, Mar 1, 2019 at 10:10 AM Kristopher Kane wrote: > > We are using the assign API to do batch work with Spark and Kafka. > What I'm seeing

Spark 2.4 Structured Streaming Kafka assign API polling same offsets

2019-03-01 Thread Kristopher Kane
We are using the assign API to do batch work with Spark and Kafka. What I'm seeing is the Spark executor work happening in the back ground and constantly polling the same data over and over until the main thread commits the offsets. Is the below a blocking operation? Dataset df =

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-03-01 Thread Gabor Somogyi
> I am thinking of writing out the dfKV dataframe to disk and then use Avro apis to read the data. Ping me if you have something, I'm planning similar things... On Thu, Feb 28, 2019 at 5:27 PM Hien Luu wrote: > Thanks for the answer. > > As far as the next step goes, I am thinking of writing

Re: [Spark SQL]: sql.DataFrame.replace to accept regexp

2019-03-01 Thread Richard Garris
You can file a feature request at https://issues.apache.org/jira/projects/SPARK/ As a workaround you can create a user defined function like so:

Spark Streaming loading kafka source value column type

2019-03-01 Thread oskarryn
Hi, Why is `value` column in streamed dataframe obtained from kafka topic natively of binary type (look at the table https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) if in fact it holds a string with the message's data and we CAST it as string anyways?

spark.python.worker.memory VS spark.executor.pyspark.memory

2019-03-01 Thread Andrey Dudin
Hello all. What is the difference between spark.python.worker.memory and spark.executor.pyspark.memory?

[Spark SQL]: sql.DataFrame.replace to accept regexp

2019-03-01 Thread Nuno Silva
Hi, Not sure if I'm delivering my request through the right channel: would it be possible for sql.DataFrame.replace to accept regular expressions, like in pandas.DataFrame.replace ? Thank you, Nuno