[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217398#comment-16217398
 ] 

Steve Loughran commented on SPARK-22240:
----------------------------------------

no, spark 2.2 doesn't fix this. I  have to explicitly define the schema as the 
list of headers -> String and then reader setup time drops to 6s. 
{code}
2017-10-24 19:19:31,593 [ScalaTest-main-running-S3ACommitBulkDataSuite] DEBUG 
fs.FsUrlStreamHandlerFactory 
(FsUrlStreamHandlerFactory.java:createURLStreamHandler(107)) - Unknown protocol 
jar, delegating to default implementation
2017-10-24 19:19:36,402 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
commit.S3ACommitBulkDataSuite (Logging.scala:logInfo(54)) - Duration of set up 
initial .csv load = 6,195,721,969 nS
{code}

I'm not sure if this is related to the original bug, though it is potentially 
part of the issue. As what I'm seeing is that either schema inference always 
takes place, or taking the first line of a .gz file is enough to force reading 
the entire .gz source file

> S3 CSV number of partitions incorrectly computed
> ------------------------------------------------
>
>                 Key: SPARK-22240
>                 URL: https://issues.apache.org/jira/browse/SPARK-22240
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>            Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://<s3_path>")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to