Re: Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Jungtaek Lim
As Spark uses micro-batch for streaming, it's unavoidable to adjust the batch size properly to achieve your expectation of throughput vs latency. Especially, Spark uses global watermark which doesn't propagate (change) during micro-batch, you'd want to make the batch relatively small to make

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Let me share the Ipython notebook. On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta wrote: > Hi, > > I think that the notebook clearly demonstrates that setting the > inferTimestamp option to False does not really help. > > Is it really impossible for you to show how your own data can be

Re: unsubscribe

2020-06-30 Thread Jeff Evans
That is not how you unsubscribe. See here for instructions: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Tue, Jun 30, 2020 at 1:31 PM Bartłomiej Niemienionek < b.niemienio...@gmail.com> wrote: >

unsubscribe

2020-06-30 Thread Bartłomiej Niemienionek

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, I think that the notebook clearly demonstrates that setting the inferTimestamp option to False does not really help. Is it really impossible for you to show how your own data can be loaded? It should be simple, just open the notebook and see why the exact code you have given does not work,

spark on kubernetes client mode

2020-06-30 Thread Pradeepta Choudhury
Hii team, I am working on spark on kubernetes and was working on a scenario where i need to use spark on kubernetes in client mode from jupyter notebook from two different kubernetes cluster . Is it possible in client mode to spin up driver in one k8 cluster and executors in another k8 cluster .

Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Eric Beabes
While running my Spark (Stateful) Structured Streaming job I am setting 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are processed faster if I use a large value for this property. What I am also noticing is that until the batch is completely processed, no messages are

Re: Metrics Problem

2020-06-30 Thread Srinivas V
Then it should permission issue. What kind of cluster is it and which user is running it ? Does that user have hdfs permissions to access the folder where the jar file is ? On Mon, Jun 29, 2020 at 1:17 AM Bryan Jeffrey wrote: > Srinivas, > > Interestingly, I did have the metrics jar packaged as

Re: XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread Sean Owen
This is more a question about spark-xml, which is not part of Spark. You can ask at https://github.com/databricks/spark-xml/ but if you do please show some example of the XML input and schema and output. On Tue, Jun 30, 2020 at 11:39 AM mars76 wrote: > > Hi, > > I am trying to read XML data

XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread mars76
Hi, I am trying to read XML data from a Kafka topic and using XmlReader to convert the RDD[String] into a DataFrame conforming to predefined Schema. One issue i am running into is after saving the final Data Frame to AVRO format most of the elements data is showing up in avro files. How ever

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Hi Gourav, Please check the comments of the ticket, looks like the performance degradation is attributed to inferTimestamp option that is true by default (I have no idea why) in Spark 3.0. This forces Spark to scan entire text and so the poor performance. Regards Sanjeev > On Jun 30, 2020,

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, Sanjeev, I think that I did precisely that, can you please download my ipython notebook and have a look, and let me know where I am going wrong. its attached with the JIRA ticket. Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra wrote: > There are total 11 files as

Re: File Not Found: /tmp/spark-events in Spark 3.0

2020-06-30 Thread Jeff Evans
This should only be needed if the spark.eventLog.enabled property was set to true. Is it possible the job configuration is different between your two environments? On Mon, Jun 29, 2020 at 9:21 AM ArtemisDev wrote: > While launching a spark job from Zeppelin against a standalone spark > cluster

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There are total 11 files as part of tar. You will have to untar it to get to actual files (.json.gz) No, I am getting Count: 33447 spark.time(spark.read.json(“/data/small-anon/")) Time taken: 431 ms res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more fields]

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi Sanjeev, that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct? Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra wrote: > There is not much code, I am just using spark-shell and reading the data > like so > >

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There is not much code, I am just using spark-shell and reading the data like so spark.time(spark.read.json("/data/small-anon/")) > On Jun 30, 2020, at 3:53 AM, Gourav Sengupta > wrote: > > Hi Sanjeev, > > can you share the exact code that you are using to read the JSON files? > Currently I

Apache Spark Meetup - Wednesday 1st July

2020-06-30 Thread Joe Davies
Good morning, I hope this email finds you well. I am the host for an on-going series of live webinars/virtual meetups and the next 2 weeks are focused on Apache Spark, I was wondering if you could share within your group? It’s free to sign up and there will be live Q throughout the