Thanks Cheng, we have a workaround in place for Spark 1.3 (remove .metadata directory), good to know it will be resolved in 1.4.
-Don On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian <[email protected]> wrote: > This issue has been fixed recently in Spark 1.4 > https://github.com/apache/spark/pull/6581 > > Cheng > > > On 6/5/15 12:38 AM, Marcelo Vanzin wrote: > > I talked to Don outside the list and he says that he's seeing this issue > with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a > real issue here. > > On Wed, Jun 3, 2015 at 1:39 PM, Don Drake <[email protected]> wrote: > >> As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that >> Spark is behaving differently when reading Parquet directories that contain >> a .metadata directory. >> >> It seems that in spark 1.2.x, it would just ignore the .metadata >> directory, but now that I'm using Spark 1.3, reading these files causes the >> following exceptions: >> >> scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir") >> >> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> >> SLF4J: Defaulting to no-operation (NOP) logger implementation >> >> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for >> further details. >> >> scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown >> during a parallel computation: java.lang.RuntimeException: >> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a >> Parquet file. expected magic number at tail [80, 65, 82, 49] but found >> [116, 34, 10, 125] >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) >> >> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) >> >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) >> >> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) >> >> . >> >> . >> >> . >> >> >> >> java.lang.RuntimeException: >> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not >> a Parquet file. expected magic number at tail [80, 65, 82, 49] but found >> [116, 34, 10, 125] >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) >> >> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) >> >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) >> >> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) >> >> . >> >> . >> >> . >> >> >> >> java.lang.RuntimeException: >> hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties >> is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but >> found [117, 101, 116, 10] >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) >> >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) >> >> >> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) >> >> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) >> >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) >> >> scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) >> >> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) >> >> . >> >> . >> >> . >> >> at >> scala.collection.parallel.package$$anon$1.alongWith(package.scala:87) >> >> at >> scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86) >> >> at >> scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650) >> >> at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72) >> >> at >> scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650) >> >> at >> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190) >> >> at >> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514) >> >> at >> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162) >> >> at >> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514) >> >> at >> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) >> >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> >> >> >> When I remove the .metadata directory, it is able to read these parquet >> files just fine. >> >> I feel that Spark should ignore the dot files/directories when attempting >> to read these parquet files. I'm seeing this in CDH 5.4.2 (Spark 1.3.0 + >> patches) >> >> Thoughts? >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> http://www.MailLaunder.com/ >> 800-733-2143 >> > > > > -- > Marcelo > > > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/ 800-733-2143
