Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Ok, so for JDBC I presume it defaults to a single partition if you don't provide partitioning meta data? Thanks! Gary On 26 October 2017 at 13:43, Daniel Siegmann wrote: > Those settings apply when a shuffle happens. But they don't affect the way > the data will be partitioned when it is initi

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
Those settings apply when a shuffle happens. But they don't affect the way the data will be partitioned when it is initially read, for example spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends on how the data is split into chunks, but if there are lots of small chunks Spark w

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Thanks Daniel! I've been wondering that for ages! IE where my JDBC sourced datasets are coming up with 200 partitions on write to S3. What do you mean for (except for the initial read)? Can you explain that a bit further? Gary Lucas On 26 October 2017 at 11:28, Daniel Siegmann wrote: > When

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
When working with datasets, Spark uses spark.sql.shuffle.partitions. It defaults to 200. Between that and the default parallelism you can control the number of partitions (except for the initial read). More info here: http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configurati

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma
I guess the issue is spark.default.parallelism is ignored when you are working with Data frames.It is supposed to work with only raw RDDs. Thanks Deepak On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda < noo...@noorul.com> wrote: > Hi all, > > I have the following spark configurati

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
I think we'd need to see the code that loads the df. Parallelism and partition count are related but they're not the same. I've found the documentation fuzzy on this, but it looks like default.parrallelism is what spark uses for partitioning when it has no other guidance. I'm also under the impr

Controlling number of spark partitions in dataframes

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all, I have the following spark configuration spark.app.name=Test spark.cassandra.connection.host=127.0.0.1 spark.cassandra.connection.keep_alive_ms=5000 spark.cassandra.connection.port=1 spark.cassandra.connection.timeout_ms=3 spark.cleaner.ttl=3600 spark.default.parallelism=4 spark.m