Cc'd Parquet dev list. At first I expected to discuss this issue on Parquet dev list but sent to the wrong mailing list. However, I think it's OK to discuss it here since lots of Spark users are using Parquet and this information should be generally useful here.

Comments inlined.

On 12/7/15 10:34 PM, Shushant Arora wrote:
how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.
Didn't realize that you meant to inspect min/max values since what you asked was how to inspect the version of Parquet library that is used to generate the Parquet file.

Currently parquet-tools doesn't print min/max statistics information. I'm afraid you'll have to do it programmatically.
How can I see parquet version of my file.Is min max respective to some parquet version or available since beginning?
AFAIK, it was added in 1.5.0 https://github.com/apache/parquet-mr/blob/parquet-1.5.0/parquet-column/src/main/java/parquet/column/statistics/Statistics.java

But I failed to find corresponding JIRA ticket or pull request for this.


On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet <absi...@informatica.com <mailto:absi...@informatica.com>> wrote:

    Yes, Parquet has min/max.

    *From:*Cheng Lian [mailto:l...@databricks.com
    <mailto:l...@databricks.com>]
    *Sent:* Monday, December 07, 2015 11:21 AM
    *To:* Ted Yu
    *Cc:* Shushant Arora; user@spark.apache.org
    <mailto:user@spark.apache.org>
    *Subject:* Re: parquet file doubts

    Oh sorry... At first I meant to cc spark-user list since Shushant
    and I had been discussed some Spark related issues before. Then I
    realized that this is a pure Parquet issue, but forgot to change
    the cc list. Thanks for pointing this out! Please ignore this thread.

    Cheng

    On 12/7/15 12:43 PM, Ted Yu wrote:

        Cheng:

        I only see user@spark in the CC.

        FYI

        On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian
        <l...@databricks.com <mailto:l...@databricks.com>> wrote:

        cc parquet-dev list (it would be nice to always do so for
        these general questions.)

        Cheng

        On 12/6/15 3:10 PM, Shushant Arora wrote:

        Hi

        I have few doubts on parquet file format.

        1.Does parquet keeps min max statistics like in ORC. how can I
        see parquet version(whether its1.1,1.2or1.3) for parquet file
        generated using hive or custom MR or AvroParquetoutputFormat.

        Yes, Parquet also keeps row group statistics. You may check
        the Parquet file using the parquet-meta CLI tool in
        parquet-tools (see
        https://github.com/Parquet/parquet-mr/issues/321 for details),
        then look for the "creator" field of the file. For
        programmatic access, check for
        o.a.p.hadoop.metadata.FileMetaData.createdBy.


        2.how to sort parquet records while generating parquet file
        using avroparquetoutput format?

        AvroParquetOutputFormat is not a format. It's just responsible
        for converting Avro records to Parquet records. How are you
        using AvroParquetOutputFormat? Any example snippets?


        Thanks



        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>



Reply via email to