Re: Skip Corrupted Parquet blocks / footer.
After checking the codes, I think there are few issues regarding this ignoreCorruptFiles config, so you can't actually use it with Parquet files now. I opened a JIRA https://issues.apache.org/jira/browse/SPARK-19082 and also submitted a PR for it. khyati wrote > Hi Reynold Xin, > > In spark 2.1.0, > I tried setting spark.sql.files.ignoreCorruptFiles = true by using > commands, > > val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) > > sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / > sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") > > but still getting error while reading parquet files using > val newDataDF = > sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") > > Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) > java.io.IOException: Could not read footer: java.lang.RuntimeException: > hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [65, 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20466.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Skip Corrupted Parquet blocks / footer.
Forget to say, another option is we can replace readAllFootersInParallel with our parallel reading logic, so we can ignore corrupt files. Liang-Chi Hsieh wrote > Hi, > > The method readAllFootersInParallel is implemented in Parquet's > ParquetFileReader. So the spark config > "spark.sql.files.ignoreCorruptFiles" doesn't work for it. > > Reading all footers in parallel can speed up the task. However, we can't > control if ignoring corrupt files or not. > > Of course we can read this footers in sequence and ignore the corrupt > ones. But it might be inefficient. Since this is a relatively corner use > case, I don't expect we can have this. > > Maybe Parquet can implement an option to ignore corrupt files. However, > even so, it can't be expected to have this updated Parquet implementation > available to Spark very soon. > > khyati wrote >> Hi Reynold Xin, >> >> In spark 2.1.0, >> I tried setting spark.sql.files.ignoreCorruptFiles = true by using >> commands, >> >> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) >> >> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / >> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") >> >> but still getting error while reading parquet files using >> val newDataDF = >> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") >> >> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID >> 4) >> java.io.IOException: Could not read footer: java.lang.RuntimeException: >> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a >> Parquet file. expected magic number at tail [80, 65, 82, 49] but found >> [65, 82, 49, 10] >> at >> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) >> >> >> Please let me know if I am missing anything. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20451.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Skip Corrupted Parquet blocks / footer.
Hi, The method readAllFootersInParallel is implemented in Parquet's ParquetFileReader. So the spark config "spark.sql.files.ignoreCorruptFiles" doesn't work for it. Reading all footers in parallel can speed up the task. However, we can't control if ignoring corrupt files or not. Of course we can read this footers in sequence and ignore the corrupt ones. But it might be inefficient. Since this is a relatively corner use case, I don't expect we can have this. Maybe Parquet can implement an option to ignore corrupt files. However, even so, it can't be expected to have this updated Parquet implementation available to Spark very soon. khyati wrote > Hi Reynold Xin, > > In spark 2.1.0, > I tried setting spark.sql.files.ignoreCorruptFiles = true by using > commands, > > val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) > > sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / > sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") > > but still getting error while reading parquet files using > val newDataDF = > sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") > > Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) > java.io.IOException: Could not read footer: java.lang.RuntimeException: > hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [65, 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20450.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Skip Corrupted Parquet blocks / footer.
Yes! Using spark 2.1.0 . I hope the command used to set the conf is correct. sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20444.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Skip Corrupted Parquet blocks / footer.
Yes! Using spark 2.1 . I hope i am using right syntax for setting up conf. sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") Sent from my Samsung Galaxy smartphone. Original message From: Ryan Blue <rb...@netflix.com> Date: 04/01/2017 1:43 a.m. (GMT+05:30) To: khyati <khyati.s...@guavus.com> Cc: Spark Dev List <dev@spark.apache.org> Subject: Re: Skip Corrupted Parquet blocks / footer. Khyati, Are you using Spark 2.1? The usual entry point for Spark 2.x is spark rather than sqlContext. rb On Tue, Jan 3, 2017 at 11:03 AM, khyati <khyati.s...@guavus.com> wrote: Hi Reynold Xin, I tried setting spark.sql.files.ignoreCorruptFiles = true by using commands, val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") but still getting error while reading parquet files using val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.io.IOException: Could not read footer: java.lang.RuntimeException: hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65, 82, 49, 10] at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) Please let me know if I am missing anything. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20433.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Ryan BlueSoftware EngineerNetflix
Re: Skip Corrupted Parquet blocks / footer.
Khyati, Are you using Spark 2.1? The usual entry point for Spark 2.x is spark rather than sqlContext. rb On Tue, Jan 3, 2017 at 11:03 AM, khyatiwrote: > Hi Reynold Xin, > > I tried setting spark.sql.files.ignoreCorruptFiles = true by using > commands, > > val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) > > sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / > sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") > > but still getting error while reading parquet files using > val newDataDF = > sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/ > tempparquetdata/data1.parquet") > > Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) > java.io.IOException: Could not read footer: java.lang.RuntimeException: > hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65, > 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel( > ParquetFileReader.java:248) > > > Please let me know if I am missing anything. > > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Skip-Corrupted- > Parquet-blocks-footer-tp20418p20433.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix
Re: Skip Corrupted Parquet blocks / footer.
Hi Reynold Xin, I tried setting spark.sql.files.ignoreCorruptFiles = true by using commands, val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") but still getting error while reading parquet files using val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.io.IOException: Could not read footer: java.lang.RuntimeException: hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65, 82, 49, 10] at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) Please let me know if I am missing anything. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20433.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Skip Corrupted Parquet blocks / footer.
In Spark 2.1, set spark.sql.files.ignoreCorruptFiles to true. On Sun, Jan 1, 2017 at 1:11 PM, khyatiwrote: > Hi, > > I am trying to read the multiple parquet files in sparksql. In one dir > there > are two files, of which one is corrupted. While trying to read these files, > sparksql throws Exception for the corrupted file. > > val newDataDF = > sqlContext.read.parquet("/data/testdir/data1.parquet","/ > data/testdir/corruptblock.0") > newDataDF.show > > throws Exception. > > Is there any way to just skip the file having corrupted block/footer and > just read the file/files which are proper? > > Thanks > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Skip-Corrupted- > Parquet-blocks-footer-tp20418.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Skip Corrupted Parquet blocks / footer.
You will have to change the metadata file under _spark_metadata folder to remove the listing of corrupt files. Thanks, Shobhit G Sent from my iPhone > On Dec 31, 2016, at 8:11 PM, khyati [via Apache Spark Developers List] >wrote: > > Hi, > > I am trying to read the multiple parquet files in sparksql. In one dir there > are two files, of which one is corrupted. While trying to read these files, > sparksql throws Exception for the corrupted file. > > val newDataDF = > sqlContext.read.parquet("/data/testdir/data1.parquet","/data/testdir/corruptblock.0") > > newDataDF.show > > throws Exception. > > Is there any way to just skip the file having corrupted block/footer and just > read the file/files which are proper? > > Thanks > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418.html > To start a new topic under Apache Spark Developers List, email > ml-node+s1001551n1...@n3.nabble.com > To unsubscribe from Apache Spark Developers List, click here. > NAML - Regards, Abhi -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20420.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.