Tiago Albineli Motta created SPARK-12677:
--------------------------------------------

             Summary: Lazy file discovery for parquet
                 Key: SPARK-12677
                 URL: https://issues.apache.org/jira/browse/SPARK-12677
             Project: Spark
          Issue Type: Wish
          Components: SQL
            Reporter: Tiago Albineli Motta
            Priority: Minor


When using sqlContext.read.parquet(files: _*) the driver verifyies if 
everything is ok with the files. But reading those files is lazy, so when it 
starts maybe the files are not there anymore, or they have changed, so we 
receive this error message:

{quote}
16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 
16, riolb586.globoi.com): java.io.FileNotFoundException: File does not exist: 
hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-00003-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
        at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
        at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
        at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153)
        at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
        at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
        at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{quote}

Maybe if  sqlContext.read.parquet could receive a Function to discover the 
files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
files )






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to