Hi,
This is an interesting point of view. I thought the HashPartitioner works
completely differently.
Here's my understanding - the HashPartitioner defines how keys are
distributed within a dataset between the different partitions, but play no
role in assigning each partition for processing by
Hi,
Yes, I'm running the executors with 8 cores each. I also have properly
configured executor memory, driver memory, num execs and so on in submit
cmd.
I'm a long time spark user, please lets skip the dummy cmd configuration
stuff and dive in the interesting stuff :)
Another strange thing I've
Hi,
I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
I observe a very strange issue.
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.
I run it with 25 executors
Additionally, if I delete the parquet and recreate it using the same generic
save function with 1000 partitions and overwrite the size is again correct.
--
View this message in context:
Hi,
Kudos on Spark 1.3.x, it's a great release - loving data frames!
One thing I noticed after upgrading is that if I use the generic save
DataFrame function with Overwrite mode and a parquet source it produces
much larger output parquet file.
Source json data: ~500GB
Originally saved parquet: