[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203600#comment-16203600 ]
Hyukjin Kwon commented on SPARK-22240: -------------------------------------- I am sorry for late response. {quote} If it's multiline related, do you think this would apply to all filesystems, or is it possible that s3a is making things worse? {quote} Yes, I think it applies to all filesystems. {quote} Spark 2.0.2 has builtin support for multiline without the option so I guess having only one partition in Spark 2.2 is kind of fail-safe mechanism and we are just lucky to have never encountered any problems with our files when reading multiline records in Spark 2.0.2. {quote} Yea, that's true. The support there would not take the case of a row spans multiple blocks into account. To cover this case, {{multiLine}} support was introduced, which is consistent with JSON option. {quote} If it's any help I also tried with s3n and it's the same thing. Didn't try with HDFS as I am only interacting with S3 at the moment. If I have a moment I shall try. {quote} {quote} We've got a test in HADOOP-14943 which looks @ part sizing, confirms you only get one part back from s3a right now. {quote} Possibly few issues could be convoluted here. I would like to suggest to test this without {{multiLine}} option, [~artb]. I believe that's the easiest way to clarify the issue. If my understanding is correct, this should return a single partition even if we disable {{multiLine}}? > S3 CSV number of partitions incorrectly computed > ------------------------------------------------ > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 > Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues in Spark 2.2.0. Maybe there is a > property that should be set to have the number of partitions computed > correctly. > I'm aware that the .option("multiline","true") is not supported in Spark > 2.0.2, it's not relevant here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org