Re: Different execution results with wholestage codegen on and off

2020-05-27 Thread Xiao Li
Thanks for reporting it. Please open a JIRA with a test case. Cheers, Xiao On Wed, May 27, 2020 at 1:42 PM Pasha Finkelshteyn < pavel.finkelsht...@gmail.com> wrote: > Hi folks, > > I'm implementing Kotlin bindings for Spark and faced strange problem. In > one cornercase Spark works differently

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Randy, Yes, I'm using parquet on both S3 and hdfs. On Thu, 28 May, 2020, 2:38 am randy clinton, wrote: > Is the file Parquet on S3 or is it some other file format? > > In general I would assume that HDFS read/writes are more performant for > spark jobs. > > For instance, consider how well

Re: Using Spark Accumulators with Structured Streaming

2020-05-27 Thread Something Something
Yes, that's exactly how I am creating them. Question... Are you using 'Stateful Structured Streaming' in which you've something like this? .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())( updateAcrossEvents ) And updating the Accumulator inside 'updateAcrossEvents'?

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread randy clinton
Is the file Parquet on S3 or is it some other file format? In general I would assume that HDFS read/writes are more performant for spark jobs. For instance, consider how well partitioned your HDFS file is vs the S3 file. On Wed, May 27, 2020 at 1:51 PM Dark Crusader wrote: > Hi Jörn, > >

Different execution results with wholestage codegen on and off

2020-05-27 Thread Pasha Finkelshteyn
Hi folks, I'm implementing Kotlin bindings for Spark and faced strange problem. In one cornercase Spark works differently when wholestage codegen is on or off. Does it look like bug ot expected behavior? -- Regards, Pasha Big Data Tools @ JetBrains signature.asc Description: PGP signature

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Jörn, Thanks for the reply. I will try to create a easier example to reproduce the issue. I will also try your suggestion to look into the UI. Can you guide on what I should be looking for? I was already using the s3a protocol to compare the times. My hunch is that multiple reads from S3

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Jörn Franke
Have you looked in Spark UI why this is the case ? S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3. It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from there .

Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library. If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening. Is this a

Re: Using Spark Accumulators with Structured Streaming

2020-05-27 Thread Srinivas V
Yes, I am talking about Application specific Accumulators. Actually I am getting the values printed in my driver log as well as sent to Grafana. Not sure where and when I saw 0 before. My deploy mode is “client” on a yarn cluster(not local Mac) where I submit from master node. It should work the

Re: Regarding Spark 3.0 GA

2020-05-27 Thread Sean Owen
No firm dates; it always depends on RC voting. Another RC is coming soon. It is however looking pretty close to done. On Wed, May 27, 2020 at 3:54 AM ARNAV NEGI SOFTWARE ARCHITECT < negi.ar...@gmail.com> wrote: > Hi, > > I am working on Spark 3.0 preview release for large Spark jobs on >

Re: Regarding Spark 3.0 GA

2020-05-27 Thread Gaetano Fabiano
I have no idea. I compiled a docker image that you can find on docker hub and you can do some experiments with it composing a cluster. https://hub.docker.com/r/gaetanofabiano/spark Let me know if you will have news about release Regards Inviato da iPhone > Il giorno 27 mag 2020, alle ore

Regarding Spark 3.0 GA

2020-05-27 Thread ARNAV NEGI SOFTWARE ARCHITECT
Hi, I am working on Spark 3.0 preview release for large Spark jobs on Kubernetes and preview looks promising. Can I understand when the Spark 3.0 GA is expected? Definitive dates will help us plan our roadmap with Spark 3.0. Arnav Negi / Technical Architect | Web Technology Enthusiast

Spark on kubernetes memory spike and spark.kubernetes.memoryOverheadFactor not working

2020-05-27 Thread Maiti, Mousam
Hi Team, We are using spark on Kubernetes, through spark-on-k8s-operator. Our application deals with multiple updateStateByKey operations. Upon investigation, we found that the spark application consumes a higher volume of memory. As spark-on-k8s-operator doesn't give the option to segregate

Re: How to enable hive support on an existing Spark session?

2020-05-27 Thread HARSH TAKKAR
Hi Kun, You can use following spark property instead while launching the app instead of manually enabling it in the code. spark.sql.catalogImplementation=hive Kind Regards Harsh On Tue, May 26, 2020 at 9:55 PM Kun Huang (COSMOS) wrote: > > Hi Spark experts, > > I am seeking for an approach