Mike Dias created SPARK-26532: --------------------------------- Summary: repartitionByRange strategy reads source files twice Key: SPARK-26532 URL: https://issues.apache.org/jira/browse/SPARK-26532 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0, 2.3.2 Reporter: Mike Dias Attachments: repartition Stages.png, repartitionByRange Stages.png
When using repartitionByRange in Structured Stream API for reading then write files, it reads the source files twice. Example: {code:java} val ds = spark.readStream. format("text"). option("path", "data/streaming"). load val q = ds. repartitionByRange(10, $"value"). writeStream. format("parquet"). option("path", "/tmp/output"). option("checkpointLocation", "/tmp/checkpoint"). start() {code} This execution creates 3 stages: 2 for reading and 1 for writing, reading the source twice. It's easy to see it in a large dataset where the reading process time is doubled. {code:java} $ curl -s -XGET http://localhost:4040/api/v1/applications/<shell_app_id>/stages {code} This is very different from the repartition strategy, which creates 2 stages: 1 for reading and 1 for writing. {code:java} val ds = spark.readStream. format("text"). option("path", "data/streaming"). load val q = ds. repartition(10, $"value"). writeStream. format("parquet"). option("path", "/tmp/output"). option("checkpointLocation", "/tmp/checkpoint"). start(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org