[ https://issues.apache.org/jira/browse/SPARK-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust updated SPARK-2119: ------------------------------------ Target Version/s: 1.1.0 > Reading Parquet InputSplits dominates query execution time when reading off S3 > ------------------------------------------------------------------------------ > > Key: SPARK-2119 > URL: https://issues.apache.org/jira/browse/SPARK-2119 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.0.0 > Reporter: Michael Armbrust > Assignee: Cheng Lian > Priority: Critical > > Here's the relevant stack trace where things are hanging: > {code} > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) > {code} > We should parallelize or cache or something here. -- This message was sent by Atlassian JIRA (v6.2#6252)