[jira] [Created] (SPARK-26532) repartitionByRange strategy reads source files twice

Mike Dias (JIRA) Thu, 03 Jan 2019 21:24:18 -0800

Mike Dias created SPARK-26532:
---------------------------------

             Summary: repartitionByRange strategy reads source files twice
                 Key: SPARK-26532
                 URL: https://issues.apache.org/jira/browse/SPARK-26532
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.4.0, 2.3.2
            Reporter: Mike Dias
         Attachments: repartition Stages.png, repartitionByRange Stages.png


When using repartitionByRange in Structured Stream API for reading then write 
files, it reads the source files twice. 

Example:
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartitionByRange(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start()
{code}
This execution creates 3 stages: 2 for reading and 1 for writing, reading the 
source twice. It's easy to see it in a large dataset where the reading process 
time is doubled.

 
{code:java}
$ curl -s -XGET http://localhost:4040/api/v1/applications/<shell_app_id>/stages
{code}
 

 

This is very different from the repartition strategy, which creates 2 stages: 1 
for reading and 1 for writing.
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartition(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start(){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26532) repartitionByRange strategy reads source files twice

Reply via email to