Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh

After checking the codes, I think there are few issues regarding this
ignoreCorruptFiles config, so you can't actually use it with Parquet files
now.

I opened a JIRA https://issues.apache.org/jira/browse/SPARK-19082 and also
submitted a PR for it.


khyati wrote
> Hi Reynold Xin,
> 
> In spark 2.1.0,
> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
> commands,
> 
> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
> 
> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
> 
> but still getting error while reading parquet files using 
> val newDataDF =
> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")
> 
> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> java.io.IOException: Could not read footer: java.lang.RuntimeException:
> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
> [65, 82, 49, 10]
>   at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
> 
> 
> Please let me know if I am missing anything.





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20466.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh


Forget to say, another option is we can replace readAllFootersInParallel
with our parallel reading logic, so we can ignore corrupt files.


Liang-Chi Hsieh wrote
> Hi,
> 
> The method readAllFootersInParallel is implemented in Parquet's
> ParquetFileReader. So the spark config
> "spark.sql.files.ignoreCorruptFiles" doesn't work for it.
> 
> Reading all footers in parallel can speed up the task. However, we can't
> control if ignoring corrupt files or not.
> 
> Of course we can read this footers in sequence and ignore the corrupt
> ones. But it might be inefficient. Since this is a relatively corner use
> case, I don't expect we can have this.
> 
> Maybe Parquet can implement an option to ignore corrupt files. However,
> even so, it can't be expected to have this updated Parquet implementation
> available to Spark very soon.
> 
> khyati wrote
>> Hi Reynold Xin,
>> 
>> In spark 2.1.0,
>> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
>> commands,
>> 
>> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
>> 
>> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
>> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
>> 
>> but still getting error while reading parquet files using 
>> val newDataDF =
>> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")
>> 
>> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID
>> 4)
>> java.io.IOException: Could not read footer: java.lang.RuntimeException:
>> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
>> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
>> [65, 82, 49, 10]
>>  at
>> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
>> 
>> 
>> Please let me know if I am missing anything.





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20451.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh

Hi,

The method readAllFootersInParallel is implemented in Parquet's
ParquetFileReader. So the spark config "spark.sql.files.ignoreCorruptFiles"
doesn't work for it.

Reading all footers in parallel can speed up the task. However, we can't
control if ignoring corrupt files or not.

Of course we can read this footers in sequence and ignore the corrupt ones.
But it might be inefficient. Since this is a relatively corner use case, I
don't expect we can have this.

Maybe Parquet can implement an option to ignore corrupt files. However, even
so, it can't be expected to have this updated Parquet implementation
available to Spark very soon.



khyati wrote
> Hi Reynold Xin,
> 
> In spark 2.1.0,
> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
> commands,
> 
> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
> 
> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
> 
> but still getting error while reading parquet files using 
> val newDataDF =
> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")
> 
> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> java.io.IOException: Could not read footer: java.lang.RuntimeException:
> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
> [65, 82, 49, 10]
>   at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
> 
> 
> Please let me know if I am missing anything.





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20450.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati
Yes! Using spark 2.1.0 . I hope the command used to set the conf is correct.

sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20444.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati . shah
Yes! Using spark 2.1 . I hope i am using right syntax for setting up conf.
sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /

sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
Sent from my Samsung Galaxy smartphone.
 Original message From: Ryan Blue <rb...@netflix.com> Date: 
04/01/2017  1:43 a.m.  (GMT+05:30) To: khyati <khyati.s...@guavus.com> Cc: 
Spark Dev List <dev@spark.apache.org> Subject: Re: Skip Corrupted Parquet 
blocks / footer. 
Khyati,
Are you using Spark 2.1? The usual entry point for Spark 2.x is spark rather 
than sqlContext.
rb
​
On Tue, Jan 3, 2017 at 11:03 AM, khyati <khyati.s...@guavus.com> wrote:
Hi Reynold Xin,



I tried setting spark.sql.files.ignoreCorruptFiles = true by using commands,



val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)



sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /

sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")



but still getting error while reading parquet files using

val newDataDF =

sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")



Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)

java.io.IOException: Could not read footer: java.lang.RuntimeException:

hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a

Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65,

82, 49, 10]

        at

org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)





Please let me know if I am missing anything.









--

View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20433.html

Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






-- 
Ryan BlueSoftware EngineerNetflix



Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread Ryan Blue
Khyati,

Are you using Spark 2.1? The usual entry point for Spark 2.x is spark
rather than sqlContext.

rb
​

On Tue, Jan 3, 2017 at 11:03 AM, khyati  wrote:

> Hi Reynold Xin,
>
> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
> commands,
>
> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
>
> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
>
> but still getting error while reading parquet files using
> val newDataDF =
> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/
> tempparquetdata/data1.parquet")
>
> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> java.io.IOException: Could not read footer: java.lang.RuntimeException:
> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
> Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65,
> 82, 49, 10]
> at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(
> ParquetFileReader.java:248)
>
>
> Please let me know if I am missing anything.
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Skip-Corrupted-
> Parquet-blocks-footer-tp20418p20433.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati
Hi Reynold Xin,

I tried setting spark.sql.files.ignoreCorruptFiles = true by using commands,

val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")

but still getting error while reading parquet files using 
val newDataDF =
sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")

Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.io.IOException: Could not read footer: java.lang.RuntimeException:
hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [65,
82, 49, 10]
at
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)


Please let me know if I am missing anything.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20433.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Skip Corrupted Parquet blocks / footer.

2017-01-01 Thread Reynold Xin
In Spark 2.1, set spark.sql.files.ignoreCorruptFiles to true.

On Sun, Jan 1, 2017 at 1:11 PM, khyati  wrote:

> Hi,
>
> I am trying to read the multiple parquet files in sparksql. In one dir
> there
> are two files, of which one is corrupted. While trying to read these files,
> sparksql throws Exception for the corrupted file.
>
> val newDataDF =
> sqlContext.read.parquet("/data/testdir/data1.parquet","/
> data/testdir/corruptblock.0")
> newDataDF.show
>
> throws Exception.
>
> Is there any way to just skip the file having corrupted block/footer and
> just read the file/files which are proper?
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Skip-Corrupted-
> Parquet-blocks-footer-tp20418.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Skip Corrupted Parquet blocks / footer.

2017-01-01 Thread Abhishek
You will have to change the metadata file under _spark_metadata folder to 
remove the listing of corrupt files.

Thanks,
Shobhit G 

Sent from my iPhone

> On Dec 31, 2016, at 8:11 PM, khyati [via Apache Spark Developers List] 
>  wrote:
> 
> Hi, 
> 
> I am trying to read the multiple parquet files in sparksql. In one dir there 
> are two files, of which one is corrupted. While trying to read these files, 
> sparksql throws Exception for the corrupted file. 
> 
> val newDataDF = 
> sqlContext.read.parquet("/data/testdir/data1.parquet","/data/testdir/corruptblock.0")
>  
> newDataDF.show 
> 
> throws Exception. 
> 
> Is there any way to just skip the file having corrupted block/footer and just 
> read the file/files which are proper? 
> 
> Thanks 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418.html
> To start a new topic under Apache Spark Developers List, email 
> ml-node+s1001551n1...@n3.nabble.com 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML




-
Regards, 
Abhi
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20420.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.