Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread ZHANG Wei
I can't reproduce the issue with my simple code: ```scala spark.streams.addListener(new StreamingQueryListener { override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = { println(event.progress.id + " is on progress") println(s"My accu is

[Apache Spark][Streaming Job][Checkpoint]Spark job failed on Checkpoint recovery with Batch not found error

2020-05-28 Thread taylorwu
Hi, We have a Spark 2.4 job failed on Checkpoint recovery every few hours with the following errors (from the Driver Log): driver spark-kubernetes-driver ERROR 20:38:51 ERROR MicroBatchExecution: Query impressionUpdate [id = 54614900-4145-4d60-8156-9746ffc13d1f, runId =

External hive metastore (remote) managed tables

2020-05-28 Thread Debajyoti Roy
Hi, anyone knows the behavior of dropping managed tables in case of external hive meta store: Deletion of the data (e.g. from object store) happens from Spark sql or, the external hive metastore ? Confused by local mode and remote mode codes.

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Something Something
I am assuming StateUpdateTask is your application specific class. Does it have 'updateState' method or something? I googled but couldn't find any documentation about doing it this way. Can you please direct me to some documentation. Thanks. On Thu, May 28, 2020 at 4:43 AM Srinivas V wrote: >

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread Kanwaljit Singh
You can’t play much if it is a streaming job. But in case of batch jobs, sometimes teams will copy their S3 data to HDFS in prep for the next run :D From: randy clinton Date: Thursday, May 28, 2020 at 5:50 AM To: Dark Crusader Cc: Jörn Franke , user Subject: Re: Spark dataframe hdfs vs s3

Re: CSV parsing issue

2020-05-28 Thread Sean Owen
I don't think so, that data is inherently ambiguous and incorrectly formatted. If you know something about the structure, maybe you can rewrite the middle column manually to escape the inner quotes and then reparse. On Thu, May 28, 2020 at 10:25 AM elango vaidyanathan wrote: > Is there any way

Re: CSV parsing issue

2020-05-28 Thread elango vaidyanathan
Is there any way I can handle it in code? Thanks, Elango On Thu, May 28, 2020, 8:52 PM Sean Owen wrote: > Your data doesn't escape double-quotes. > > On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan > wrote: > >> >> Hi team, >> >> I am loading an CSV. One column contains a json value. I

Re: CSV parsing issue

2020-05-28 Thread Sean Owen
Your data doesn't escape double-quotes. On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan wrote: > > Hi team, > > I am loading an CSV. One column contains a json value. I am unable to > parse that column properly. Below is the details. Can you please check once? > > > > val

CSV parsing issue

2020-05-28 Thread elango vaidyanathan
Hi team, I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once? val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"") .option("escape",

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Srinivas V
Giving the code below: //accumulators is a class level variable in driver. sparkSession.streams().addListener(new StreamingQueryListener() { @Override public void onQueryStarted(QueryStartedEvent queryStarted) { logger.info("Query started: " +

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread ZHANG Wei
May I get how the accumulator is accessed in the method `onQueryProgress()`? AFAICT, the accumulator is incremented well. There is a way to verify that in cluster like this: ``` // Add the following while loop before invoking awaitTermination while (true) { println("My acc: " +

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread randy clinton
See if this helps "That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."*

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Srinivas V
yes, I am using stateful structured streaming. Yes similar to what you do. This is in Java I do it this way: Dataset productUpdates = watermarkedDS .groupByKey( (MapFunction) event -> event.getId(), Encoders.STRING()) .mapGroupsWithState(

Re: Regarding Spark 3.0 GA

2020-05-28 Thread ARNAV NEGI SOFTWARE ARCHITECT
Thanks Fabiano. I am building one myself. Will surely use yours as quick starter. On Wed, 27 May 2020, 18:00 Gaetano Fabiano, wrote: > I have no idea. > > I compiled a docker image that you can find on docker hub and you can do > some experiments with it composing a cluster. > >

Re: Regarding Spark 3.0 GA

2020-05-28 Thread ARNAV NEGI SOFTWARE ARCHITECT
Ok, thanks for the update Sean. Can I also track RC vote? On Wed, 27 May 2020, 18:12 Sean Owen, wrote: > No firm dates; it always depends on RC voting. Another RC is coming soon. > It is however looking pretty close to done. > > On Wed, May 27, 2020 at 3:54 AM ARNAV NEGI SOFTWARE ARCHITECT <