Re: SparkSQL overwrite parquet file does not generate _common_metadata

Pei-Lun Lee Fri, 27 Mar 2015 19:54:39 -0700

JIRA ticket created at:
https://issues.apache.org/jira/browse/SPARK-6581


Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

>  Thanks for the information. Verified that the _common_metadata and
> _metadata file are missing in this case when using Hadoop 1.0.4. Would you
> mind to open a JIRA for this?
>
> Cheng
>
> On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
>
> I'm using 1.0.4
>
>  Thanks,
> --
> Pei-Lun
>
> On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>>  Hm, which version of Hadoop are you using? Actually there should also
>> be a _metadata file together with _common_metadata. I was using Hadoop
>> 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did
>> observe cases where Spark behaves differently because of semantic
>> differences of the same API in different Hadoop versions.
>>
>> Cheng
>>
>> On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
>>
>> Hi Cheng,
>>
>>  on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode.
>> Overwrite) produces:
>>
>>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
>> total 32
>> -rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00001.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00002.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00003.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-00004.parquet*
>>
>>  while res0.save("xxx") produces:
>>
>>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
>> total 40
>> -rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
>> -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00001.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00002.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00003.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-00004.parquet*
>>
>> On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian <lian.cs....@gmail.com>
>> wrote:
>>
>>>  I couldn’t reproduce this with the following spark-shell snippet:
>>>
>>> scala> import sqlContext.implicits._
>>> scala> Seq((1, 2)).toDF("a", "b")
>>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>>>
>>> The _common_metadata file is typically much smaller than _metadata,
>>> because it doesn’t contain row group information, and thus can be faster to
>>> read than _metadata.
>>>
>>> Cheng
>>>
>>> On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
>>>
>>> Hi,
>>>
>>>  When I save parquet file with SaveMode.Overwrite, it never generate
>>> _common_metadata. Whether it overwrites an existing dir or not.
>>> Is this expected behavior?
>>> And what is the benefit of _common_metadata? Will reading performs
>>> better when it is present?
>>>
>>>  Thanks,
>>> --
>>> Pei-Lun
>>>
>>>  
>>>
>>
>>
>>
>
>

Re: SparkSQL overwrite parquet file does not generate _common_metadata

Reply via email to