Ok, so for JDBC I presume it defaults to a single partition if you don't
provide partitioning meta data?
Thanks!
Gary
On 26 October 2017 at 13:43, Daniel Siegmann wrote:
> Those settings apply when a shuffle happens. But they don't affect the way
> the data will be partitioned when it is initi
Those settings apply when a shuffle happens. But they don't affect the way
the data will be partitioned when it is initially read, for example
spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends on
how the data is split into chunks, but if there are lots of small chunks
Spark w
Thanks Daniel!
I've been wondering that for ages!
IE where my JDBC sourced datasets are coming up with 200 partitions on
write to S3.
What do you mean for (except for the initial read)?
Can you explain that a bit further?
Gary Lucas
On 26 October 2017 at 11:28, Daniel Siegmann wrote:
> When
When working with datasets, Spark uses spark.sql.shuffle.partitions. It
defaults to 200. Between that and the default parallelism you can control
the number of partitions (except for the initial read).
More info here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configurati
I guess the issue is spark.default.parallelism is ignored when you are
working with Data frames.It is supposed to work with only raw RDDs.
Thanks
Deepak
On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
noo...@noorul.com> wrote:
> Hi all,
>
> I have the following spark configurati
I think we'd need to see the code that loads the df.
Parallelism and partition count are related but they're not the same. I've
found the documentation fuzzy on this, but it looks like
default.parrallelism is what spark uses for partitioning when it has no
other guidance. I'm also under the impr
Hi all,
I have the following spark configuration
spark.app.name=Test
spark.cassandra.connection.host=127.0.0.1
spark.cassandra.connection.keep_alive_ms=5000
spark.cassandra.connection.port=1
spark.cassandra.connection.timeout_ms=3
spark.cleaner.ttl=3600
spark.default.parallelism=4
spark.m