[ https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26532. ---------------------------------- Resolution: Not A Problem > repartitionByRange reads source files twice > ------------------------------------------- > > Key: SPARK-26532 > URL: https://issues.apache.org/jira/browse/SPARK-26532 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming > Affects Versions: 2.3.2, 2.4.0 > Reporter: Mike Dias > Priority: Minor > Attachments: repartition Stages.png, repartitionByRange Stages.png > > > When using repartitionByRange in Structured Stream API for reading then write > files, it reads the source files twice. > Example: > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartitionByRange(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start() > {code} > This execution creates 3 stages: 2 for reading and 1 for writing, reading the > source twice. It's easy to see it in a large dataset where the reading > process time is doubled. > > {code:java} > $ curl -s -XGET > http://localhost:4040/api/v1/applications/<shell_app_id>/stages > {code} > > > This is very different from the repartition strategy, which creates 2 stages: > 1 for reading and 1 for writing. > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartition(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start(){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org