Sushant Pandey created GOBBLIN-383:
--------------------------------------
Summary: Compaction job's output is not compressed
Key: GOBBLIN-383
URL: https://issues.apache.org/jira/browse/GOBBLIN-383
Project: Apache Gobblin
Issue Type: Bug
Components: gobblin-compaction
Affects Versions: 0.11.0
Reporter: Sushant Pandey
Assignee: Issac Buenrostro
Attachments: mr_compact.txt
Output of compaction job on snappy compressed avro files is not compressed, in
effect size of output file is considerably more than the sum of sizes of input
files. Compaction job is running with following parameters:
{color:#333333}{{fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020}}
{{writer.fs.uri=${fs.uri}}}{{job.name=CompactKafkaMR}}
{{job.group=PNDA}}{{mr.job.max.mappers=5}}{{compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder}}
{{compaction.input.dir=/user/pnda/PNDA_datasets/datasets}}
{{compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8}}
{{compaction.input.subdir=.}}
{{compaction.dest.subdir=.}}
{{compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH}}
{{compaction.timebased.max.time.ago=10d}}
{{compaction.timebased.min.time.ago=1h}}
{{compaction.input.deduplicated=true}}
{{compaction.output.deduplicated=true}}
{{compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator}}
{{compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner}}
{{compaction.timezone=UTC}}
{{compaction.job.overwrite.output.dir=true}}
{{compaction.recompact.from.input.for.late.data=true}}{color}{{}}
Tried following configuration options with no success:
{{mapreduce.output.fileoutputformat.compress=true}}
{{mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.SnappyCodec}}
{{mapreduce.output.fileoutputformat.compress.type=RECORD}}
{{writer.codec.type=SNAPPY}}
{{writer.builder.class=gobblin.writer.AvroDataWriterBuilder}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)