[ https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970286#comment-15970286 ]
Hyukjin Kwon commented on SPARK-12677: -------------------------------------- Please reopen this if anyone would like to stay against the current behaviour. I only resolve this as a respect of the decision made in that PR and it seems there is no interest in this issue for quite long time. > Lazy file discovery for parquet > ------------------------------- > > Key: SPARK-12677 > URL: https://issues.apache.org/jira/browse/SPARK-12677 > Project: Spark > Issue Type: Wish > Components: SQL > Reporter: Tiago Albineli Motta > Priority: Minor > Labels: features > > When using sqlContext.read.parquet(files: _*) the driver verifyies if > everything is ok with the files. But reading those files is lazy, so when it > starts maybe the files are not there anymore, or they have changed, so we > receive this error message: > {quote} > 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 > (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not > exist: > hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-00003-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Maybe if sqlContext.read.parquet could receive a Function to discover the > files instead it could be avoided. Like this: sqlContext.read.parquet( () => > files ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org