[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217398#comment-16217398 ]
Steve Loughran commented on SPARK-22240: ---------------------------------------- no, spark 2.2 doesn't fix this. I have to explicitly define the schema as the list of headers -> String and then reader setup time drops to 6s. {code} 2017-10-24 19:19:31,593 [ScalaTest-main-running-S3ACommitBulkDataSuite] DEBUG fs.FsUrlStreamHandlerFactory (FsUrlStreamHandlerFactory.java:createURLStreamHandler(107)) - Unknown protocol jar, delegating to default implementation 2017-10-24 19:19:36,402 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO commit.S3ACommitBulkDataSuite (Logging.scala:logInfo(54)) - Duration of set up initial .csv load = 6,195,721,969 nS {code} I'm not sure if this is related to the original bug, though it is potentially part of the issue. As what I'm seeing is that either schema inference always takes place, or taking the first line of a .gz file is enough to force reading the entire .gz source file > S3 CSV number of partitions incorrectly computed > ------------------------------------------------ > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 > Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues in Spark 2.2.0. Maybe there is a > property that should be set to have the number of partitions computed > correctly. > I'm aware that the .option("multiline","true") is not supported in Spark > 2.0.2, it's not relevant here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org