[
https://issues.apache.org/jira/browse/GOBBLIN-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushant Pandey updated GOBBLIN-383:
-----------------------------------
Description:
Output of compaction job on snappy compressed avro files is not compressed, in
effect size of output file is considerably more than the sum of sizes of input
files.
job is configured to run with following parameters:
{color:#333333}{{fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020}}
{{writer.fs.uri=$\{fs.uri}}}{{job.name=CompactKafkaMR}}
{{job.group=PNDA}}{{mr.job.max.mappers=5}}{{compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder}}
{{compaction.input.dir=/user/pnda/PNDA_datasets/datasets}}
{{compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8}}
{{compaction.input.subdir=.}}
{{compaction.dest.subdir=.}}
{{compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH}}
{{compaction.timebased.max.time.ago=10d}}
{{compaction.timebased.min.time.ago=1h}}
{{compaction.input.deduplicated=true}}
{{compaction.output.deduplicated=true}}
{{compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator}}
{{compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner}}
{{compaction.timezone=UTC}}
{{compaction.job.overwrite.output.dir=true}}
{{compaction.recompact.from.input.for.late.data=true}}{color}{{}}
Tried following configuration options with no success:
{{mapreduce.output.fileoutputformat.compress=true}}
{{mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.SnappyCodec}}
{{mapreduce.output.fileoutputformat.compress.type=RECORD}}
{{writer.codec.type=SNAPPY}}
{{writer.builder.class=gobblin.writer.AvroDataWriterBuilder}}
was:
Output of compaction job on snappy compressed avro files is not compressed, in
effect size of output file is considerably more than the sum of sizes of input
files. Compaction job is running with following parameters:
{color:#333333}{{fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020}}
{{writer.fs.uri=${fs.uri}}}{{job.name=CompactKafkaMR}}
{{job.group=PNDA}}{{mr.job.max.mappers=5}}{{compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder}}
{{compaction.input.dir=/user/pnda/PNDA_datasets/datasets}}
{{compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8}}
{{compaction.input.subdir=.}}
{{compaction.dest.subdir=.}}
{{compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH}}
{{compaction.timebased.max.time.ago=10d}}
{{compaction.timebased.min.time.ago=1h}}
{{compaction.input.deduplicated=true}}
{{compaction.output.deduplicated=true}}
{{compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator}}
{{compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner}}
{{compaction.timezone=UTC}}
{{compaction.job.overwrite.output.dir=true}}
{{compaction.recompact.from.input.for.late.data=true}}{color}{{}}
Tried following configuration options with no success:
{{mapreduce.output.fileoutputformat.compress=true}}
{{mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.SnappyCodec}}
{{mapreduce.output.fileoutputformat.compress.type=RECORD}}
{{writer.codec.type=SNAPPY}}
{{writer.builder.class=gobblin.writer.AvroDataWriterBuilder}}
> Compaction job's output is not compressed
> -----------------------------------------
>
> Key: GOBBLIN-383
> URL: https://issues.apache.org/jira/browse/GOBBLIN-383
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-compaction
> Affects Versions: 0.11.0
> Reporter: Sushant Pandey
> Assignee: Issac Buenrostro
> Priority: Major
> Attachments: mr_compact.txt
>
>
> Output of compaction job on snappy compressed avro files is not compressed,
> in effect size of output file is considerably more than the sum of sizes of
> input files.
> job is configured to run with following parameters:
> {color:#333333}{{fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020}}
> {{writer.fs.uri=$\{fs.uri}}}{{job.name=CompactKafkaMR}}
>
> {{job.group=PNDA}}{{mr.job.max.mappers=5}}{{compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder}}
> {{compaction.input.dir=/user/pnda/PNDA_datasets/datasets}}
> {{compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8}}
> {{compaction.input.subdir=.}}
> {{compaction.dest.subdir=.}}
>
> {{compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH}}
> {{compaction.timebased.max.time.ago=10d}}
> {{compaction.timebased.min.time.ago=1h}}
> {{compaction.input.deduplicated=true}}
> {{compaction.output.deduplicated=true}}
>
> {{compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator}}
>
> {{compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner}}
> {{compaction.timezone=UTC}}
> {{compaction.job.overwrite.output.dir=true}}
> {{compaction.recompact.from.input.for.late.data=true}}{color}{{}}
>
> Tried following configuration options with no success:
> {{mapreduce.output.fileoutputformat.compress=true}}
>
> {{mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.SnappyCodec}}
> {{mapreduce.output.fileoutputformat.compress.type=RECORD}}
> {{writer.codec.type=SNAPPY}}
> {{writer.builder.class=gobblin.writer.AvroDataWriterBuilder}}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)