[pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

2019-05-21 Thread Rishi Shah
Hi All, What is the best way to determine partitions of a dataframe dynamically before writing to disk? 1) statically determine based on data and use coalesce or repartition while writing 2) somehow determine count of records for entire dataframe and divide that number to determine partition -

Re: [spark on yarn] spark on yarn without DFS

2019-05-21 Thread Huizhe Wang
Hi Hari, Thanks :) I tried to do it as u said. It works ;) Hariharan 于2019年5月20日 周一下午3:54写道: > Hi Huizhe, > > You can set the "fs.defaultFS" field in core-site.xml to some path on s3. > That way your spark job will use S3 for all operations that need HDFS. > Intermediate data will still be

Offsets out of order - Spark Datasource V2

2019-05-21 Thread Cressy, Taylor
Hello Spark community, Please let me know if this is the appropriate place to ask this question – will happily move it. I haven’t been able to find the answer going to the usual outlets. I am currently implementing two custom readers for our projects (JMS / SQS) and am experiencing a problem

Structred Streaming Error

2019-05-21 Thread KhajaAsmath Mohammed
Hi, I am getting below errror when running sample strreaming app. does anyone have resolution for this? JSON OFFSET {"test":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0}} - Herreee root |-- key: string (nullable = true) |-- value: string (nullable = true) |-- topic: string (nullable = true) |--

Re: run new spark version on old spark cluster ?

2019-05-21 Thread Nicolas Paris
Hi I finally got all working. Here are the steps (for information, I am on HDP 2.6.5): - copy the old hive-site.xml into the new spark conf folder - (optional?) donwload the jersey-bundle-1.8.jar and put it into the jars folder - build a tar gz from all the jars and copy that archive to hdfs

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok great. I understood the ideology, thanks. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Yes. If the job fails repeatedly (4 times in this case), Spark assumes that there is a problem in the Job and notifies the user. In exchange for this, the engine can go on to serve other jobs with its available resources. I would try the following until things improve: 1. Figure out what's

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
umm, i am not sure if I got this fully. It is a design decision to not have context.stop() right after awaitTermination throws exception? So, the ideology is that if after n tries (default 4) a task fails, the spark should fail fast and let user know? Is this correct? As you mentioned there

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Correction: The Driver manages the Tasks, the resource manager serves up resources to the Driver or Task. On Tue, May 21, 2019 at 9:11 AM Jason Nerothin wrote: > The behavior is a deliberate design decision by the Spark team. > > If Spark were to "fail fast", it would prevent the system from

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
The behavior is a deliberate design decision by the Spark team. If Spark were to "fail fast", it would prevent the system from recovering from many classes of errors that are in principle recoverable (for example if two otherwise unrelated jobs cause a garbage collection spike on the same node).

Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache()

Re: double quota is automaticly added when sinking as csv

2019-05-21 Thread Akshay Bhardwaj
Hi, Add writeStream.option("quoteMode", "NONE") By default Spark dataset api assumes that all the values MUST BE enclosed in quote character (def: ") while writing to CSV files. Akshay Bhardwaj +91-97111-33849 On Tue, May 21, 2019 at 5:34 PM 杨浩 wrote: > We use struct streaming 2.2, when

double quota is automaticly added when sinking as csv

2019-05-21 Thread 杨浩
We use struct streaming 2.2, when sinking as csv, a json str will automatic add "" for it, like an element is > > {"hello": "world"} result data in fs will be > "{\"hello\": \"world\"}" How to avoid the "",we only want > {"hello": "world"} code like > resultDS. > writeStream. >

NoClassDefFoundError

2019-05-21 Thread Sachit Murarka
Hi All, I have simply added exception handling in my code in Scala. I am getting NoClassDefFoundError . Any leads would be appreciated. Thanks Kind Regards, Sachit Murarka

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok, I found the reason. In my QueueStream example, I have a while(true) which keeps on adding the RDDs, my awaitTermination call if after the while loop. Since, the while loop never exits, awaitTermination never gets fired and never get reported the exceptions. The above was just the problem

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Just to add to my previous message. I am using Spark 2.2.2 standalone cluster manager and deploying the jobs in cluster mode. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

How does dynamic allocation decide spark executor cores?

2019-05-21 Thread Pooja Agrawal
Hi All, I am a newbie to spark and have a quick question. I am running Spark 2.3.2 on YARN using HDP 2.8.5 on EMR -5.19.0 Since EMR version is 5.19, dynamic allocation is set to true by default. I haven't set the min and max executors but saw that by default it starts with max( initial

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
I was able to reproduce the problem. In the below repository, I have 2 sample jobs. Both are execution 1/0 (Arithmetic Exception) on the executor sides and but in case of NetworkWordCount job, awaitTerminate throws the same exceptions (Job aborted due to stage failure .) that I can see in the

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau
Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes. I have a config with a lot of ram and I'm willing to configure a a