[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203709#comment-16203709 ]
Steve Loughran commented on SPARK-22240: ---------------------------------------- [~hyukjin.kwon]: we now see that on s3a, you only ever get a partition size from the FS of 1, which means when spark asks the FS "how many blocks is this file in", the answer is 1. That's info which can be used for determining what gives best IO performance for multiple executors. In HDFS, for a file in N blocks & R replicas, you get N * R actual blocks you can read in parallel. For S3 it may look like there is just 1 block, but if you upload in multiple parts (as happens when the write gets above some threshold like 64M, then you do get parallelism on the read. There's no way to determine what the block size was on the upload (maybe we could save it as a header in future), but we can get the FS to tell the query engine what the current blocksize is & so be consistent across writes and reads. Spark does its own partitioning too, with the default min size & other things controlling the decision, e.g. what the user asks for, #of executors, whether the format says it can be split (anything .gz == false, etc). So it may just be the multiline case. the local filesystem only returns one element for getFileBlockLocations(path, ...), so you can just check locally if the problem exists when multiline = true vs false. partitions > S3 CSV number of partitions incorrectly computed > ------------------------------------------------ > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 > Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues in Spark 2.2.0. Maybe there is a > property that should be set to have the number of partitions computed > correctly. > I'm aware that the .option("multiline","true") is not supported in Spark > 2.0.2, it's not relevant here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org