Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
I'm using 1.0.4

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hm, which version of Hadoop are you using? Actually there should also be
 a _metadata file together with _common_metadata. I was using Hadoop 2.4.1
 btw. I'm not sure whether Hadoop version matters here, but I did observe
 cases where Spark behaves differently because of semantic differences of
 the same API in different Hadoop versions.

 Cheng

 On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

 Hi Cheng,

  on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
 Overwrite) produces:

  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
 total 32
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

  while res0.save(xxx) produces:

  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
 total 40
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

 On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com wrote:

  I couldn’t reproduce this with the following spark-shell snippet:

 scala import sqlContext.implicits._
 scala Seq((1, 2)).toDF(a, b)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)

 The _common_metadata file is typically much smaller than _metadata,
 because it doesn’t contain row group information, and thus can be faster to
 read than _metadata.

 Cheng

 On 3/26/15 12:48 PM, Pei-Lun Lee wrote:

 Hi,

  When I save parquet file with SaveMode.Overwrite, it never generate
 _common_metadata. Whether it overwrites an existing dir or not.
 Is this expected behavior?
 And what is the benefit of _common_metadata? Will reading performs better
 when it is present?

  Thanks,
 --
 Pei-Lun

  ​






Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
Thanks for the information. Verified that the _common_metadata and 
_metadata file are missing in this case when using Hadoop 1.0.4. Would 
you mind to open a JIRA for this?


Cheng

On 3/27/15 2:40 PM, Pei-Lun Lee wrote:

I'm using 1.0.4

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Hm, which version of Hadoop are you using? Actually there should
also be a _metadata file together with _common_metadata. I was
using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version
matters here, but I did observe cases where Spark behaves
differently because of semantic differences of the same API in
different Hadoop versions.

Cheng

On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

Hi Cheng,

on my computer, execute res0.save(xxx,
org.apache.spark.sql.SaveMode.Overwrite) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$ ls
-l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
part-r-4.parquet*

while res0.save(xxx) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$ ls
-l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
part-r-4.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian
lian.cs@gmail.com mailto:lian.cs@gmail.com wrote:

I couldn’t reproduce this with the following spark-shell snippet:

|scala import sqlContext.implicits._
scala Seq((1, 2)).toDF(a, b)
scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
|

The _common_metadata file is typically much smaller than
_metadata, because it doesn’t contain row group information,
and thus can be faster to read than _metadata.

Cheng

On 3/26/15 12:48 PM, Pei-Lun Lee wrote:


Hi,

When I save parquet file with SaveMode.Overwrite, it never
generate _common_metadata. Whether it overwrites an existing
dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading
performs better when it is present?

Thanks,
--
Pei-Lun

​









Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
JIRA ticket created at:
https://issues.apache.org/jira/browse/SPARK-6581

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian lian.cs@gmail.com wrote:

  Thanks for the information. Verified that the _common_metadata and
 _metadata file are missing in this case when using Hadoop 1.0.4. Would you
 mind to open a JIRA for this?

 Cheng

 On 3/27/15 2:40 PM, Pei-Lun Lee wrote:

 I'm using 1.0.4

  Thanks,
 --
 Pei-Lun

 On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hm, which version of Hadoop are you using? Actually there should also
 be a _metadata file together with _common_metadata. I was using Hadoop
 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did
 observe cases where Spark behaves differently because of semantic
 differences of the same API in different Hadoop versions.

 Cheng

 On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

 Hi Cheng,

  on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
 Overwrite) produces:

  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
 total 32
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

  while res0.save(xxx) produces:

  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
 total 40
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

 On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  I couldn’t reproduce this with the following spark-shell snippet:

 scala import sqlContext.implicits._
 scala Seq((1, 2)).toDF(a, b)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)

 The _common_metadata file is typically much smaller than _metadata,
 because it doesn’t contain row group information, and thus can be faster to
 read than _metadata.

 Cheng

 On 3/26/15 12:48 PM, Pei-Lun Lee wrote:

 Hi,

  When I save parquet file with SaveMode.Overwrite, it never generate
 _common_metadata. Whether it overwrites an existing dir or not.
 Is this expected behavior?
 And what is the benefit of _common_metadata? Will reading performs
 better when it is present?

  Thanks,
 --
 Pei-Lun

  ​








Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian

I couldn’t reproduce this with the following spark-shell snippet:

|scala import sqlContext.implicits._
scala Seq((1, 2)).toDF(a, b)
scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
|

The _common_metadata file is typically much smaller than _metadata, 
because it doesn’t contain row group information, and thus can be faster 
to read than _metadata.


Cheng

On 3/26/15 12:48 PM, Pei-Lun Lee wrote:


Hi,

When I save parquet file with SaveMode.Overwrite, it never generate 
_common_metadata. Whether it overwrites an existing dir or not.

Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs 
better when it is present?


Thanks,
--
Pei-Lun


​


Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
Hi Cheng,

on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
Overwrite) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

while res0.save(xxx) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com wrote:

  I couldn’t reproduce this with the following spark-shell snippet:

 scala import sqlContext.implicits._
 scala Seq((1, 2)).toDF(a, b)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)
 scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite)

 The _common_metadata file is typically much smaller than _metadata,
 because it doesn’t contain row group information, and thus can be faster to
 read than _metadata.

 Cheng

 On 3/26/15 12:48 PM, Pei-Lun Lee wrote:

   Hi,

  When I save parquet file with SaveMode.Overwrite, it never generate
 _common_metadata. Whether it overwrites an existing dir or not.
 Is this expected behavior?
 And what is the benefit of _common_metadata? Will reading performs better
 when it is present?

  Thanks,
 --
 Pei-Lun

   ​



SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi,

When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?

Thanks,
--
Pei-Lun