Re: SparkSQL overwrite parquet file does not generate _common_metadata
I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions. Cheng On 3/27/15 11:33 AM, Pei-Lun Lee wrote: Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* while res0.save(xxx) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 40 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com wrote: I couldn’t reproduce this with the following spark-shell snippet: scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) The _common_metadata file is typically much smaller than _metadata, because it doesn’t contain row group information, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun
Re: SparkSQL overwrite parquet file does not generate _common_metadata
Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop 1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions. Cheng On 3/27/15 11:33 AM, Pei-Lun Lee wrote: Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* while res0.save(xxx) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 40 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file is typically much smaller than _metadata, because it doesn’t contain row group information, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun
Re: SparkSQL overwrite parquet file does not generate _common_metadata
JIRA ticket created at: https://issues.apache.org/jira/browse/SPARK-6581 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian lian.cs@gmail.com wrote: Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop 1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions. Cheng On 3/27/15 11:33 AM, Pei-Lun Lee wrote: Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* while res0.save(xxx) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 40 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com wrote: I couldn’t reproduce this with the following spark-shell snippet: scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) The _common_metadata file is typically much smaller than _metadata, because it doesn’t contain row group information, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun
Re: SparkSQL overwrite parquet file does not generate _common_metadata
I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file is typically much smaller than _metadata, because it doesn’t contain row group information, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun
Re: SparkSQL overwrite parquet file does not generate _common_metadata
Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* while res0.save(xxx) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 40 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com wrote: I couldn’t reproduce this with the following spark-shell snippet: scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) The _common_metadata file is typically much smaller than _metadata, because it doesn’t contain row group information, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun