[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

Steve Loughran (JIRA) Fri, 13 Oct 2017 08:33:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203709#comment-16203709
 ]


Steve Loughran commented on SPARK-22240:
----------------------------------------

[~hyukjin.kwon]: we now see that on s3a, you only ever get a partition size 
from the FS of 1, which means when spark asks the FS "how many blocks is this 
file in", the answer is 1. That's info which can be used for determining what 
gives best IO performance for multiple executors. In HDFS, for a file in N 
blocks & R replicas, you get N * R actual blocks you can read in parallel. For 
S3 it may look like there is just 1 block, but if you upload in multiple parts 
(as happens when the write gets above some threshold like 64M, then you do get 
parallelism on the read. There's no way to determine what the block size was on 
the upload (maybe we could save it as a header in future), but we can get the 
FS to tell the query engine what the current blocksize is & so be consistent 
across writes and reads.

Spark does its own partitioning too, with the default min size & other things 
controlling the decision, e.g. what the user asks for, #of executors, whether 
the format says it can be split (anything .gz == false, etc). So it may just be 
the multiline case. 

the local filesystem only returns  one element for getFileBlockLocations(path, 
...), so you can just check locally if the problem exists when multiline = true 
vs false. partitions 


> S3 CSV number of partitions incorrectly computed
> ------------------------------------------------
>
>                 Key: SPARK-22240
>                 URL: https://issues.apache.org/jira/browse/SPARK-22240
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>            Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

Reply via email to