Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, This is an interesting point of view. I thought the HashPartitioner works completely differently. Here's my understanding - the HashPartitioner defines how keys are distributed within a dataset between the different partitions, but play no role in assigning each partition for processing by

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, Yes, I'm running the executors with 8 cores each. I also have properly configured executor memory, driver memory, num execs and so on in submit cmd. I'm a long time spark user, please lets skip the dummy cmd configuration stuff and dive in the interesting stuff :) Another strange thing I've

Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors

Re: DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Additionally, if I delete the parquet and recreate it using the same generic save function with 1000 partitions and overwrite the size is again correct. -- View this message in context:

DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Hi, Kudos on Spark 1.3.x, it's a great release - loving data frames! One thing I noticed after upgrading is that if I use the generic save DataFrame function with Overwrite mode and a parquet source it produces much larger output parquet file. Source json data: ~500GB Originally saved parquet: