Re: dounbts on parquet

Cheng Lian Thu, 19 Nov 2015 01:25:32 -0800

/cc Spark user list

I'm confused here, you mentioned that you were writing Parquet filesusing MR jobs. What's the relation between that Parquet writing task andthis JavaPairRDD one? Is it a separate problem?

Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1","col2").format("<data source name>").save(path)), and there's aspark-avro<logWarning%28%22WARNING:%20Failed%20to%20write%20command%20history%20file:%20%22%20+%20e.getMessage%29>data source. If you are writing Avro records to multiple partitions,these two should help.


Cheng

On 11/19/15 4:30 PM, Shushant Arora wrote:

Thanks Cheng.

I have used avroParquetOutputFormat and it works fine.

my requirement is now to handle writing in multiple folders at sametime. Basically the JavaPairrdd<Void,GenericRecord> I want to write tomultiple folders based on final hive partitions where this rdd willlend.Have you used multiple output formats in spark?

On Fri, Nov 13, 2015 at 3:56 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Oh I see. Then parquet-avro should probably be more useful. AFAIK,
    parquet-hive is only used internally in Hive. I don't see anyone
    using it directly.

    In general, you can first parse your text data, assemble them into
    Avro records, and then write these records to Parquet.

    BTW, Spark 1.2 also provides Parquet support. Since you're trying
    to convert text data, I guess you probably don't have any nested
    data. In that case, Spark 1.2 should be enough. it's not that
    Spark 1.2 can't deal with nested data, it's about interoperability
    with Hive. Because in the early days, Parquet spec itself didn't
    specify how to write nested data. You may refer to this link for
    more details:
    http://spark.apache.org/docs/1.2.1/sql-programming-guide.html#parquet-files

    Cheng


    On 11/13/15 6:11 PM, Shushant Arora wrote:

    No , I don't have data loaded in text form to hive- It was for
    getting internals of what approach hive is taking .

    I want direct writing to parquet file from MR job. For that Hive
    Parquet datamodel vs Avro Parquet data model which approach is
    better?


    On Fri, Nov 13, 2015 at 3:24 PM, Cheng Lian
    <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

        If you are already able to load the text data into Hive, then
        using Hive itself to convert the data is obviously the
        easiest and most compatible way. For example:

        CREATE TABLE text_table (key INT, value STRING);
        LOAD DATA LOCAL INPATH '/tmp/data.txt' INTO TABLE text_table;

        CREATE TABLE parquet_table
        STORED AS PARQUET
        AS SELECT * FROM text_table;

        Cheng


        On 11/13/15 5:13 PM, Shushant Arora wrote:

        Thanks !
        so which one is better for dumping text data to hive using
        custom MR/spark job - Hive Parquet datamodel using
        hivewritable or Avro Parquet datamodel using avro object?

        On Fri, Nov 13, 2015 at 12:45 PM, Cheng Lian
        <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

            ParquetOutputFormat is not a data model. A data model
            provides a WriteSupport to ParquetOutputFormat to tell
            Parquet how to convert upper level domain objects (Hive
            Writables, Avro records, etc.) to Parquet records. So
            all other data models uses it for writing Parquet files.

            Hive does have a Parquet data model. If you create a
            Parquet table in Hive like "CREATE TABLE t (key INT,
            value STRING) STORED AS PARQUET", it invokes the Hive
            Parquet data model when reading/write table t. In the
            case you mentioned, records in the text table are
            firstly extracted out by Hive into Hive Writables, and
            then the Hive Parquet data model converts those
            Writables into Parquet records.

            Cheng


            On 11/13/15 2:37 PM, Shushant Arora wrote:

            Thanks Cheng.

            I have spark version 1.2 deployed on my cluster so for
            the time being I cannot use direct spark sql functionality.
            I will try with AvroParquetOutputFormat. Just want to
            know how AvroParquetOutputFormat is better than direct
            ParquetOutputFormat ? And also is there any hive object
            model - I mean when I create a parquet table in hive
            and insert data in that table using text table which
            object model does hive uses internally?

            Thanks
            Shushant

            On Fri, Nov 13, 2015 at 9:14 AM, Cheng Lian
            <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>>
            wrote:

                If I understand your question correctly, you are
                trying to write Parquet files using a specific
                Parquet data model, and expect to load them into
                Hive, right?

                Spark also implements a Parquet data model, which
                converts Spark SQL rows into Parquet records. If
                you're already using Spark, then this can be very
                convenient. For example, in Spark 1.5, assuming
                that you've already deployed Spark against an
                existing Hive metastore, you may easily save a
                DataFrame as a Hive Parquet table like this:

                sqlContext.range(10).write.format("parquet").saveAsTable("t")

                Then you should be able to read this Parquet table
                from Hive side.

                If you are going to use parquet-mr directly, then
                parquet-avro is always recommended since it's the
                most standard one out there. But please use
                parquet-mr 1.8+, because earlier versions don't
                write correct LIST and MAP structures. Also, when
                interacting with Hive, all fields in your Avro
                schema should be marked as optional because Hive
                doesn't support required fields.

                Cheng

                On Fri, Nov 13, 2015 at 1:56 AM, Shushant Arora
                <shushantaror...@gmail.com
                <mailto:shushantaror...@gmail.com>> wrote:

                    Hi Cheng

                    Have got your reference from spark mailing
                    list. I have few doubts on parquet - It would
                    be helpful if you can help here

                    I have a requirement to write data in parquet
                    format in hive table. I amusing spark job.

                    In parquet there are so many object model to
                    write to file-
                    
AvroParquetOutputFormat,ProtoParquetOutputFormat,ThriftPrquetOutputFormat,ParquetOutputormat.

                    1.Which one is better and is recommended over
                    other?
                    2.Which one is best for hive?
                    3.Do you have any sample program in hadoop mr
                    or spark for this ?

                    Thanks
                    Shushant

Re: dounbts on parquet

Reply via email to