Hello All,
I am new to Parquet. Please forgive me if the question is repetitive. I was
considering to use Parquet for a project. So I was playing with it. The
simple thing I did was reading avro objects from hdfs and writing it back
to hdfs in parquet file format. I used Crunch pipeline for it. One thing I
noticed that it's require more heap to run pipeline. When I was not using
parquet fileformat, memory settings was 2gb heap and 4gb virtual. When I
switched to parquet fileformat, required memory settings to run the
pipeline is 8gb heap and 10gb virtual.If I give less memory the task throws
heap error. I haven't changed any settings. I was using complex multilevel
nested avro object. And the total no records were 150k. Here is the code
snippet-
final Pipeline pipeline = new MRPipeline(ParquetTest.class,
"ParquetTestPipeline", config);
final PCollection<Person> persons = pipeline.read(From.avroFile(new
Path("..hadfs source Path..."),
Avros.records(Person.class)));
final AvroParquetFileTarget parquetFileTarget = new
AvroParquetFileTarget("..hadfs target Path...");
pipeline.write(persons, parquetFileTarget);
pipeline.done();
I was using parquet version - 1.4.1
My question is - why is it taking more memory to run pipeline using parquet
fileformat?. Is it because of creating row group requires all records into
heap? Or Am I doing something wrong?
Thanks,
*Chandan Biswas*