Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-22 Thread Elkhan Dadashov
I found answer regarding logging in the JavaDoc of SparkLauncher: "Currently, all applications are launched as child processes. The child's stdout and stderr are merged and written to a logger (see java.util.logging)." One last question. sparkAppHandle.getAppId() - does this function return

Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-22 Thread Elkhan Dadashov
Thanks, Marcelo. One more question regarding getting logs. In previous implementation of SparkLauncer we could read logs from : sparkLauncher.getInputStream() sparkLauncher.getErrorStream() What is the recommended way of getting logs and logging of Spark execution while using

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-22 Thread Efe Selcuk
Ah, looks similar. Next opportunity I get, I'm going to do a printSchema on the two datasets and see if they don't match up. I assume that unioning the underlying RDDs doesn't run into this problem because of less type checking or something along those lines? On Fri, Oct 21, 2016 at 3:39 PM

Re: RDD groupBy() then random sort each group ?

2016-10-22 Thread Koert Kuipers
groupBy always materializes the entire group (on disk or in memory) which is why you should avoid it for large groups. The key is to never materialize the grouped and shuffled data. To see one approach to do this take a look at https://github.com/tresata/spark-sorted It's basically a

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Nkechi Achara
I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. On 22 October 2016 at 15:14, Steve Loughran wrote: > > > On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > > > Hi, > > > > I am using Spark

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

2016-10-22 Thread shea.parkes
Thank you for the reply tosaurabh85. We do tune and adjust our shuffle partition count, but that was not influencing the reading of parquets (the data is not shuffled as it is read as I understand it). I apologize that I actually received an answer, but it was not caught on the mailing list

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-22 Thread Steve Loughran
On 22 Oct 2016, at 00:48, Chetan Khatri > wrote: Hello Cheng, Thank you for response. I am using spark 1.6.1, i am writing around 350 gz parquet part files for single table. Processed around 180 GB of Data using Spark. Are you writing

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Steve Loughran
> On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the

why is that two stages in apache spark are computing same thing?

2016-10-22 Thread maitraythaker
I have a spark optimization query that I have posted on StackOverflow, any guidance on this would be appreciated. Please follow the link below, I have explained the problem in depth here with code.

Fwd: Spark optimization problem

2016-10-22 Thread Maitray Thaker
Hi, I have a query regarding spark stage optimization. I have asked the question in more detail at Stackoverflow, please find the following link: http://stackoverflow.com/questions/40192302/why-is- that-two-stages-in-apache-spark-are-computing-same-thing