[
https://issues.apache.org/jira/browse/GOBBLIN-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushant Pandey updated GOBBLIN-383:
-----------------------------------
Attachment: file_size.txt
> Compaction job's output is not compressed
> -----------------------------------------
>
> Key: GOBBLIN-383
> URL: https://issues.apache.org/jira/browse/GOBBLIN-383
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-compaction
> Affects Versions: 0.11.0
> Reporter: Sushant Pandey
> Assignee: Issac Buenrostro
> Priority: Major
> Attachments: file_size.txt, mr_compact.txt
>
>
> Output of compaction job on snappy compressed avro files is not compressed,
> in effect size of output file is considerably more than the sum of the sizes
> of input files.
> job is configured to run with following parameters:
> {color:#333333}{{fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020}}
> {{writer.fs.uri=$\{fs.uri}}}{{job.name=CompactKafkaMR}}
>
> {{job.group=PNDA}}{{mr.job.max.mappers=5}}{{compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder}}
> {{compaction.input.dir=/user/pnda/PNDA_datasets/datasets}}
> {{compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8}}
> {{compaction.input.subdir=.}}
> {{compaction.dest.subdir=.}}
>
> {{compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH}}
> {{compaction.timebased.max.time.ago=10d}}
> {{compaction.timebased.min.time.ago=1h}}
> {{compaction.input.deduplicated=true}}
> {{compaction.output.deduplicated=true}}
>
> {{compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator}}
>
> {{compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner}}
> {{compaction.timezone=UTC}}
> {{compaction.job.overwrite.output.dir=true}}
> {{compaction.recompact.from.input.for.late.data=true}}{color}{{}}
>
> Tried following configuration options with no success:
> {{mapreduce.output.fileoutputformat.compress=true}}
>
> {{mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.SnappyCodec}}
> {{mapreduce.output.fileoutputformat.compress.type=RECORD}}
> {{writer.codec.type=SNAPPY}}
> {{writer.builder.class=gobblin.writer.AvroDataWriterBuilder}}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)