The fact that you have 60 partitions or brokers in kaka is not directly
correlated to Spark Structured Streaming (SSS) executors by itself. See
below.
Spark starts with 200 partitions. However, by default, Spark/PySpark
creates partitions that are equal to the number of CPU cores in the node,
th
You can try the 'optimize' command of delta lake. That will help you for
sure. It merges small files. Also, it depends on the file format. If you
are working with Parquet then still small files should not cause any issues.
P.
On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong
wrote:
> Hi Raghavendr
Hi Raghavendra,
Yes, we are trying to reduce the number of files in delta as well (the
small file problem [0][1]).
We already have a scheduled app to compact files, but the number of
files is still large, at 14K files per day.
[0]: https://docs.delta.io/latest/optimizations-oss.html#language-pyt
Hi all on user@spark:
We are looking for advice and suggestions on how to tune the
.repartition() parameter.
We are using Spark Streaming on our data pipeline to consume messages
and persist them to a Delta Lake
(https://delta.io/learn/getting-started/).
We read messages from a Kafka topic, then
Hi,
What is the purpose for which you want to use repartition() .. to reduce
the number of files in delta?
Also note that there is an alternative option of using coalesce() instead
of repartition().
--
Raghavendra
On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong
wrote:
> Hi all on user@spark:
>
>
Hi all on user@spark:
We are looking for advice and suggestions on how to tune the
.repartition() parameter.
We are using Spark Streaming on our data pipeline to consume messages
and persist them to a Delta Lake
(https://delta.io/learn/getting-started/).
We read messages from a Kafka topic, then