Query on Parquet : High Heap Usage

Chandan Biswas Mon, 23 Feb 2015 18:41:06 -0800

Hello All,
I am new to Parquet. Please forgive me if the question is repetitive. I was
considering to use Parquet for a project. So I was playing with it. The
simple thing I did was reading avro objects from hdfs and writing it back
to hdfs in parquet file format. I used Crunch pipeline for it. One thing I
noticed that it's require more heap to run pipeline. When I was not using
parquet fileformat, memory settings was 2gb heap and 4gb virtual. When I
switched to parquet fileformat, required memory settings to run the
pipeline is 8gb heap and 10gb virtual.If I give less memory the task throws
heap error. I haven't changed any settings.  I was using complex multilevel
nested avro object. And the total no records were 150k. Here is the code
snippet-


        final Pipeline pipeline = new MRPipeline(ParquetTest.class,
"ParquetTestPipeline", config);

        final PCollection<Person> persons = pipeline.read(From.avroFile(new
Path("..hadfs source Path..."),
                Avros.records(Person.class)));

        final AvroParquetFileTarget parquetFileTarget = new
AvroParquetFileTarget("..hadfs target Path...");
        pipeline.write(persons, parquetFileTarget);

        pipeline.done();

I was using parquet version - 1.4.1

My question is - why is it taking more memory to run pipeline using parquet
fileformat?. Is it because of creating row group requires all records into
heap? Or Am I doing something wrong?

Thanks,
*Chandan Biswas*

Query on Parquet : High Heap Usage

Reply via email to