Re: parquet file doubts

2015-12-08 Thread Cheng Lian
Cc'd Parquet dev list. At first I expected to discuss this issue on 
Parquet dev list but sent to the wrong mailing list. However, I think 
it's OK to discuss it here since lots of Spark users are using Parquet 
and this information should be generally useful here.


Comments inlined.

On 12/7/15 10:34 PM, Shushant Arora wrote:

how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.
Didn't realize that you meant to inspect min/max values since what you 
asked was how to inspect the version of Parquet library that is used to 
generate the Parquet file.


Currently parquet-tools doesn't print min/max statistics information. 
I'm afraid you'll have to do it programmatically.
How can I see parquet version of my file.Is min max respective to some 
parquet version or available since beginning?
AFAIK, it was added in 1.5.0 
https://github.com/apache/parquet-mr/blob/parquet-1.5.0/parquet-column/src/main/java/parquet/column/statistics/Statistics.java


But I failed to find corresponding JIRA ticket or pull request for this.



On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet 
<absi...@informatica.com <mailto:absi...@informatica.com>> wrote:


Yes, Parquet has min/max.

*From:*Cheng Lian [mailto:l...@databricks.com
<mailto:l...@databricks.com>]
*Sent:* Monday, December 07, 2015 11:21 AM
*To:* Ted Yu
*Cc:* Shushant Arora; user@spark.apache.org
<mailto:user@spark.apache.org>
    *Subject:* Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since Shushant
and I had been discussed some Spark related issues before. Then I
realized that this is a pure Parquet issue, but forgot to change
the cc list. Thanks for pointing this out! Please ignore this thread.

Cheng

On 12/7/15 12:43 PM, Ted Yu wrote:

Cheng:

I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian
<l...@databricks.com <mailto:l...@databricks.com>> wrote:

cc parquet-dev list (it would be nice to always do so for
these general questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I
see parquet version(whether its1.1,1.2or1.3) for parquet file
generated using hive or custom MR or AvroParquetoutputFormat.

Yes, Parquet also keeps row group statistics. You may check
the Parquet file using the parquet-meta CLI tool in
parquet-tools (see
https://github.com/Parquet/parquet-mr/issues/321 for details),
then look for the "creator" field of the file. For
programmatic access, check for
o.a.p.hadoop.metadata.FileMetaData.createdBy.


2.how to sort parquet records while generating parquet file
using avroparquetoutput format?

AvroParquetOutputFormat is not a format. It's just responsible
for converting Avro records to Parquet records. How are you
using AvroParquetOutputFormat? Any example snippets?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>






RE: parquet file doubts

2015-12-07 Thread Singh, Abhijeet
Yes, Parquet has min/max.

From: Cheng Lian [mailto:l...@databricks.com]
Sent: Monday, December 07, 2015 11:21 AM
To: Ted Yu
Cc: Shushant Arora; user@spark.apache.org
Subject: Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since Shushant and I had 
been discussed some Spark related issues before. Then I realized that this is a 
pure Parquet issue, but forgot to change the cc list. Thanks for pointing this 
out! Please ignore this thread.

Cheng
On 12/7/15 12:43 PM, Ted Yu wrote:
Cheng:
I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian 
<l...@databricks.com<mailto:l...@databricks.com>> wrote:
cc parquet-dev list (it would be nice to always do so for these general 
questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:
Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I see parquet 
version(whether its1.1,1.2or1.3) for parquet file generated using hive or 
custom MR or AvroParquetoutputFormat.
Yes, Parquet also keeps row group statistics. You may check the Parquet file 
using the parquet-meta CLI tool in parquet-tools (see 
https://github.com/Parquet/parquet-mr/issues/321 for details), then look for 
the "creator" field of the file. For programmatic access, check for 
o.a.p.hadoop.metadata.FileMetaData.createdBy.

2.how to sort parquet records while generating parquet file using 
avroparquetoutput format?
AvroParquetOutputFormat is not a format. It's just responsible for converting 
Avro records to Parquet records. How are you using AvroParquetOutputFormat? Any 
example snippets?

Thanks


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Re: parquet file doubts

2015-12-07 Thread Shushant Arora
how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.

How can I see parquet version of my file.Is min max respective to some
parquet version or available since beginning?


On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet <absi...@informatica.com>
wrote:

> Yes, Parquet has min/max.
>
>
>
> *From:* Cheng Lian [mailto:l...@databricks.com]
> *Sent:* Monday, December 07, 2015 11:21 AM
> *To:* Ted Yu
> *Cc:* Shushant Arora; user@spark.apache.org
> *Subject:* Re: parquet file doubts
>
>
>
> Oh sorry... At first I meant to cc spark-user list since Shushant and I
> had been discussed some Spark related issues before. Then I realized that
> this is a pure Parquet issue, but forgot to change the cc list. Thanks for
> pointing this out! Please ignore this thread.
>
> Cheng
>
> On 12/7/15 12:43 PM, Ted Yu wrote:
>
> Cheng:
>
> I only see user@spark in the CC.
>
>
>
> FYI
>
>
>
> On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian <l...@databricks.com> wrote:
>
> cc parquet-dev list (it would be nice to always do so for these general
> questions.)
>
> Cheng
>
> On 12/6/15 3:10 PM, Shushant Arora wrote:
>
> Hi
>
> I have few doubts on parquet file format.
>
> 1.Does parquet keeps min max statistics like in ORC. how can I see parquet
> version(whether its1.1,1.2or1.3) for parquet file generated using hive or
> custom MR or AvroParquetoutputFormat.
>
> Yes, Parquet also keeps row group statistics. You may check the Parquet
> file using the parquet-meta CLI tool in parquet-tools (see
> https://github.com/Parquet/parquet-mr/issues/321 for details), then look
> for the "creator" field of the file. For programmatic access, check for
> o.a.p.hadoop.metadata.FileMetaData.createdBy.
>
>
> 2.how to sort parquet records while generating parquet file using
> avroparquetoutput format?
>
> AvroParquetOutputFormat is not a format. It's just responsible for
> converting Avro records to Parquet records. How are you using
> AvroParquetOutputFormat? Any example snippets?
>
>
> Thanks
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general 
questions.)


Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I see 
parquet version(whether its1.1,1.2or1.3) for parquet file generated 
using hive or custom MR or AvroParquetoutputFormat.
Yes, Parquet also keeps row group statistics. You may check the Parquet 
file using the parquet-meta CLI tool in parquet-tools (see 
https://github.com/Parquet/parquet-mr/issues/321 for details), then look 
for the "creator" field of the file. For programmatic access, check for 
o.a.p.hadoop.metadata.FileMetaData.createdBy.


2.how to sort parquet records while generating parquet file using 
avroparquetoutput format?
AvroParquetOutputFormat is not a format. It's just responsible for 
converting Avro records to Parquet records. How are you using 
AvroParquetOutputFormat? Any example snippets?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: parquet file doubts

2015-12-06 Thread Cheng Lian
Oh sorry... At first I meant to cc spark-user list since Shushant and I 
had been discussed some Spark related issues before. Then I realized 
that this is a pure Parquet issue, but forgot to change the cc list. 
Thanks for pointing this out! Please ignore this thread.


Cheng

On 12/7/15 12:43 PM, Ted Yu wrote:

Cheng:
I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian > wrote:


cc parquet-dev list (it would be nice to always do so for these
general questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I
see parquet version(whether its1.1,1.2or1.3) for parquet file
generated using hive or custom MR or AvroParquetoutputFormat.

Yes, Parquet also keeps row group statistics. You may check the
Parquet file using the parquet-meta CLI tool in parquet-tools (see
https://github.com/Parquet/parquet-mr/issues/321 for details),
then look for the "creator" field of the file. For programmatic
access, check for o.a.p.hadoop.metadata.FileMetaData.createdBy.


2.how to sort parquet records while generating parquet file
using avroparquetoutput format?

AvroParquetOutputFormat is not a format. It's just responsible for
converting Avro records to Parquet records. How are you using
AvroParquetOutputFormat? Any example snippets?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org







Re: parquet file doubts

2015-12-06 Thread Ted Yu
Cheng:
I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian  wrote:

> cc parquet-dev list (it would be nice to always do so for these general
> questions.)
>
> Cheng
>
> On 12/6/15 3:10 PM, Shushant Arora wrote:
>
>> Hi
>>
>> I have few doubts on parquet file format.
>>
>> 1.Does parquet keeps min max statistics like in ORC. how can I see
>> parquet version(whether its1.1,1.2or1.3) for parquet file generated using
>> hive or custom MR or AvroParquetoutputFormat.
>>
> Yes, Parquet also keeps row group statistics. You may check the Parquet
> file using the parquet-meta CLI tool in parquet-tools (see
> https://github.com/Parquet/parquet-mr/issues/321 for details), then look
> for the "creator" field of the file. For programmatic access, check for
> o.a.p.hadoop.metadata.FileMetaData.createdBy.
>
>>
>> 2.how to sort parquet records while generating parquet file using
>> avroparquetoutput format?
>>
> AvroParquetOutputFormat is not a format. It's just responsible for
> converting Avro records to Parquet records. How are you using
> AvroParquetOutputFormat? Any example snippets?
>
>>
>> Thanks
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>