On 10/27/2014 12:20 PM, Suraj Nayak wrote:
On Tue, Oct 28, 2014 at 12:14 AM, Suraj Nayak <[email protected]
<mailto:[email protected]>> wrote:

    Hi Ryan,

    Thanks for the detailed info on total memory used.

    The output table is partitioned by 2 columns. 1 column has 2 output
    partition and other column has 187 partitions.

    parquet.block.size is default. I have not specified it anywhere. If
    you can help me get the exact value, it will be helpful.

    Am interested to know the tools for getting this done. Kindly share
    it :)

Suraj,

The parquet.block.size should be 128MB if you've not changed it. You can always find this value in the configuration properties of your MR job (or the underlying job in the tracker if you're using Hive).

If you're comfortable writing your own MR job to do this conversion, then that works. You would just create keys from the data the match the partition scheme you're using with Hive. Your mapper creates the key for a record and writes the (key, record) pair. Then the reducer just writes all of the values it receives.

If you don't want to do this yourself, you can take a look at Kite, which has this already built so that you can call it from a command-line interface [1].

rb

[1]: http://kitesdk.org/docs/current/guide/Using-the-Kite-CLI-to-Create-a-Dataset/


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to