Re: Spark SQL DataFrame: Nullable column and filtering

2015-08-01 Thread Martin Senne
Dear all, after some fiddling I have arrived at this solution: /** * Customized left outer join on common column. */ def leftOuterJoinWithRemovalOfEqualColumn(leftDF: DataFrame, rightDF: DataFrame, commonColumnName: String): DataFrame = { val joinedDF =

No event logs in yarn-cluster mode

2015-08-01 Thread Akmal Abbasov
Hi, I am trying to configure a history server for application. When I running locally(./run-example SparkPi), the event logs are being created, and I can start history server. But when I am trying ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster

About memory leak in spark 1.4.1

2015-08-01 Thread Sea
Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die.

Re: How does the # of tasks affect # of threads?

2015-08-01 Thread Fabrice Sznajderman
Hello, I am not an expert with Spark, but the error thrown by spark seems indicate that not enough memory for launching job. By default, Spark allocated 1GB for memory, may be you should increase it ? Best regards Fabrice Le sam. 1 août 2015 à 22:51, Connor Zanin cnnr...@udel.edu a écrit :

Re: TCP/IP speedup

2015-08-01 Thread Mark Hamstra
https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com wrote: Hi All! How important would be a significant performance improvement to TCP/IP itself, in terms of overall job performance improvement. Which part

Re: TCP/IP speedup

2015-08-01 Thread Simon Edelhaus
H 2% huh. -- ttfn Simon Edelhaus California 2015 On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com wrote: Hi All!

Re: No event logs in yarn-cluster mode

2015-08-01 Thread Marcelo Vanzin
On Sat, Aug 1, 2015 at 9:25 AM, Akmal Abbasov akmal.abba...@icloud.com wrote: When I running locally(./run-example SparkPi), the event logs are being created, and I can start history server. But when I am trying ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster

Re: Does anyone have experience with using Hadoop InputFormats?

2015-08-01 Thread Antsy.Rao
Sent from my iPad On 2014-9-24, at 上午8:13, Steve Lewis lordjoe2...@gmail.com wrote: When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not

Re: How does the # of tasks affect # of threads?

2015-08-01 Thread Connor Zanin
1. I believe that the default memory (per executor) is 512m (from the documentation) 2. I have increased the memory used by spark on workers in my launch script when submitting the job (--executor-memory 124g) 3. The job completes successfully, it is the road bumps in the middle I am

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Ruslan Dautkhanov
You should also take into account amount of memory that you plan to use. It's advised not to give too much memory for each executor .. otherwise GC overhead will go up. Btw, why prime numbers? -- Ruslan Dautkhanov On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote: Hi Rahul,

TCP/IP speedup

2015-08-01 Thread Simon Edelhaus
Hi All! How important would be a significant performance improvement to TCP/IP itself, in terms of overall job performance improvement. Which part would be most significantly accelerated? Would it be HDFS? -- ttfn Simon Edelhaus California 2015

Re: TCP/IP speedup

2015-08-01 Thread Ruslan Dautkhanov
If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%. http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm Enabling Jumbo Frames across the cluster improves bandwidth If Spark workload is not network

Re: flatMap output on disk / flatMap memory overhead

2015-08-01 Thread Puneet Kapoor
Hi Ocatavian, Just out of curiosity, did you try persisting your RDD in serialized format MEMORY_AND_DISK_SER or MEMORY_ONLY_SER ?? i.e. changing your : rdd.persist(MEMORY_AND_DISK) to rdd.persist(MEMORY_ONLY_SER) Regards On Wed, Jun 10, 2015 at 7:27 AM, Imran Rashid iras...@cloudera.com