date:20190521

[pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

2019-05-21 Thread Rishi Shah

Hi All, What is the best way to determine partitions of a dataframe dynamically before writing to disk? 1) statically determine based on data and use coalesce or repartition while writing 2) somehow determine count of records for entire dataframe and divide that number to determine partition -

Re: [spark on yarn] spark on yarn without DFS

2019-05-21 Thread Huizhe Wang

Hi Hari, Thanks :) I tried to do it as u said. It works ;) Hariharan 于2019年5月20日周一下午3:54写道： > Hi Huizhe, > > You can set the "fs.defaultFS" field in core-site.xml to some path on s3. > That way your spark job will use S3 for all operations that need HDFS. > Intermediate data will still be

Offsets out of order - Spark Datasource V2

2019-05-21 Thread Cressy, Taylor

Hello Spark community, Please let me know if this is the appropriate place to ask this question – will happily move it. I haven’t been able to find the answer going to the usual outlets. I am currently implementing two custom readers for our projects (JMS / SQS) and am experiencing a problem

Structred Streaming Error

2019-05-21 Thread KhajaAsmath Mohammed

Hi, I am getting below errror when running sample strreaming app. does anyone have resolution for this? JSON OFFSET {"test":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0}} - Herreee root |-- key: string (nullable = true) |-- value: string (nullable = true) |-- topic: string (nullable = true) |--

Re: run new spark version on old spark cluster ?

2019-05-21 Thread Nicolas Paris

Hi I finally got all working. Here are the steps (for information, I am on HDP 2.6.5): - copy the old hive-site.xml into the new spark conf folder - (optional?) donwload the jersey-bundle-1.8.jar and put it into the jars folder - build a tar gz from all the jars and copy that archive to hdfs

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander

Ok great. I understood the ideology, thanks. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin

Yes. If the job fails repeatedly (4 times in this case), Spark assumes that there is a problem in the Job and notifies the user. In exchange for this, the engine can go on to serve other jobs with its available resources. I would try the following until things improve: 1. Figure out what's

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander

umm, i am not sure if I got this fully. It is a design decision to not have context.stop() right after awaitTermination throws exception? So, the ideology is that if after n tries (default 4) a task fails, the spark should fail fast and let user know? Is this correct? As you mentioned there

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin

Correction: The Driver manages the Tasks, the resource manager serves up resources to the Driver or Task. On Tue, May 21, 2019 at 9:11 AM Jason Nerothin wrote: > The behavior is a deliberate design decision by the Spark team. > > If Spark were to "fail fast", it would prevent the system from

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin

The behavior is a deliberate design decision by the Spark team. If Spark were to "fail fast", it would prevent the system from recovering from many classes of errors that are in principle recoverable (for example if two otherwise unrelated jobs cause a garbage collection spike on the same node).

Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan

When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache()

Re: double quota is automaticly added when sinking as csv

2019-05-21 Thread Akshay Bhardwaj

Hi, Add writeStream.option("quoteMode", "NONE") By default Spark dataset api assumes that all the values MUST BE enclosed in quote character (def: ") while writing to CSV files. Akshay Bhardwaj +91-97111-33849 On Tue, May 21, 2019 at 5:34 PM 杨浩 wrote: > We use struct streaming 2.2, when

double quota is automaticly added when sinking as csv

2019-05-21 Thread 杨浩

We use struct streaming 2.2, when sinking as csv, a json str will automatic add "" for it, like an element is > > {"hello": "world"} result data in fs will be > "{\"hello\": \"world\"}" How to avoid the "",we only want > {"hello": "world"} code like > resultDS. > writeStream. >

NoClassDefFoundError

2019-05-21 Thread Sachit Murarka

Hi All, I have simply added exception handling in my code in Scala. I am getting NoClassDefFoundError . Any leads would be appreciated. Thanks Kind Regards, Sachit Murarka

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander

Ok, I found the reason. In my QueueStream example, I have a while(true) which keeps on adding the RDDs, my awaitTermination call if after the while loop. Since, the while loop never exits, awaitTermination never gets fired and never get reported the exceptions. The above was just the problem

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander

Just to add to my previous message. I am using Spark 2.2.2 standalone cluster manager and deploying the jobs in cluster mode. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

How does dynamic allocation decide spark executor cores?

2019-05-21 Thread Pooja Agrawal

Hi All, I am a newbie to spark and have a quick question. I am running Spark 2.3.2 on YARN using HDP 2.8.5 on EMR -5.19.0 Since EMR version is 5.19, dynamic allocation is set to true by default. I haven't set the min and max executors but saw that by default it starts with max( initial

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander

I was able to reproduce the problem. In the below repository, I have 2 sample jobs. Both are execution 1/0 (Arithmetic Exception) on the executor sides and but in case of NetworkWordCount job, awaitTerminate throws the same exceptions (Job aborted due to stage failure .) that I can see in the

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau

Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes. I have a config with a lot of ram and I'm willing to configure a a

[pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

Re: [spark on yarn] spark on yarn without DFS

Offsets out of order - Spark Datasource V2

Structred Streaming Error

Re: run new spark version on old spark cluster ?

Re: Streaming job, catch exceptions

Re: Streaming job, catch exceptions

Re: Streaming job, catch exceptions

Re: Streaming job, catch exceptions

Re: Streaming job, catch exceptions

Re: Access to live data of cached dataFrame

Re: double quota is automaticly added when sinking as csv

double quota is automaticly added when sinking as csv

NoClassDefFoundError

Re: Streaming job, catch exceptions

Re: Streaming job, catch exceptions

How does dynamic allocation decide spark executor cores?

Re: Streaming job, catch exceptions

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

19 matches

Site Navigation

Mail list logo

Footer information