[jira] [Commented] (HADOOP-15421) Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
[ https://issues.apache.org/jira/browse/HADOOP-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503612#comment-16503612 ] Ryan Blue commented on HADOOP-15421: +1 for the committer's full classname. Enough classes are shaded and relocated that it would be useful. > Stabilise/formalise the JSON _SUCCESS format used in the S3A committers > --- > > Key: HADOOP-15421 > URL: https://issues.apache.org/jira/browse/HADOOP-15421 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Steve Loughran >Priority: Major > > the S3A committers rely on an atomic PUT to save a JSON summary of the job to > the dest FS, containing files, statistics, etc. This is for internal testing, > but it turns out to be useful for spark integration testing, Hive, etc. > IBM's stocator also generated a manifest. > Proposed: come up with (an extensible) design that we are happy with as a > long lived format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15421) Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
[ https://issues.apache.org/jira/browse/HADOOP-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458794#comment-16458794 ] Ryan Blue commented on HADOOP-15421: I think this makes sense. As long as there is a _SUCCESS file, it may as well be used to pass the scope, i.e., which files were successful. What statistics/metrics are you adding to the file? > Stabilise/formalise the JSON _SUCCESS format used in the S3A committers > --- > > Key: HADOOP-15421 > URL: https://issues.apache.org/jira/browse/HADOOP-15421 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Steve Loughran >Priority: Major > > the S3A committers rely on an atomic PUT to save a JSON summary of the job to > the dest FS, containing files, statistics, etc. This is for internal testing, > but it turns out to be useful for spark integration testing, Hive, etc. > IBM's stocator also generated a manifest. > Proposed: come up with (an extensible) design that we are happy with as a > long lived format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295256#comment-16295256 ] Ryan Blue commented on HADOOP-13126: Support for brotli, zstd, and lz4 is in Parquet master. It will be out in the next release. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15107) Prove the correctness of the new committers, or fix where they are not correct
[ https://issues.apache.org/jira/browse/HADOOP-15107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286673#comment-16286673 ] Ryan Blue commented on HADOOP-15107: For the definition of correctness, I think we will need two based on the possible failures that are handled. Task-level failure tolerance: the committer can handle any task failure, including during task commit. Job-level failure tolerance: the committer can handle failure during job commit. The contribution of the multi-part committer is that it handles task-level failure without a copy and minimizes the impact of job-level failure. But, it doesn't guarantee job-level failure if the job commit fails. > Prove the correctness of the new committers, or fix where they are not correct > -- > > Key: HADOOP-15107 > URL: https://issues.apache.org/jira/browse/HADOOP-15107 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.1.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > > I'm writing about the paper on the committers, one which, being a proper > paper, requires me to show the committers work. > # define the requirements of a "Correct" committed job (this applies to the > FileOutputCommitter too) > # show that the Staging committer meets these requirements (most of this is > implicit in that it uses the V1 FileOutputCommitter to marshall .pendingset > lists from committed tasks to the final destination, where they are read and > committed. > # Show the magic committer also works. > I'm now not sure that the magic committer works. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280469#comment-16280469 ] Ryan Blue commented on HADOOP-13126: I'm happy to if there is interest from reviewers. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15003) Merge S3A committers into trunk: Yetus patch checker
[ https://issues.apache.org/jira/browse/HADOOP-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244232#comment-16244232 ] Ryan Blue commented on HADOOP-15003: I don't think the partitioned committer should continue the _SUCCESS marker convention. Nothing that writes partitioned data currently depends on _SUCCESS markers, so it's easy to avoid the problem entirely because the markers are unreliable: what happens when you're appending data to a partition? We implemented a property that allows users to opt in to have _SUCCESS created for the directory output committer only. It creates the _SUCCESS marker after all other operations have finished because that's when we can guarantee that the write was successful. It doesn't delete other markers because there are no well-defined semantics for _SUCCESS with overwrite. > Merge S3A committers into trunk: Yetus patch checker > > > Key: HADOOP-15003 > URL: https://issues.apache.org/jira/browse/HADOOP-15003 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.0.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-041.patch, HADOOP-13786-042.patch, > HADOOP-13786-043.patch, HADOOP-13786-044.patch, HADOOP-13786-045.patch, > HADOOP-13786-046.patch > > > This is a Yetus only JIRA created to have Yetus review the > HADOOP-13786/HADOOP-14971 patch as a .patch file, as the review PR > [https://github.com/apache/hadoop/pull/282] is stopping this happening in > HADOOP-14971. > Reviews should go into the PR/other task -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202748#comment-16202748 ] Ryan Blue commented on HADOOP-13786: I'll try to take a look as well. > Add S3Guard committer for zero-rename commits to S3 endpoints > - > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: 3.0.0-beta1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-036.patch, HADOOP-13786-037.patch, > HADOOP-13786-038.patch, HADOOP-13786-039.patch, > HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch, > HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch, > HADOOP-13786-HADOOP-13345-005.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch, > HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch, > HADOOP-13786-HADOOP-13345-011.patch, HADOOP-13786-HADOOP-13345-012.patch, > HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch, > HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch, > HADOOP-13786-HADOOP-13345-018.patch, HADOOP-13786-HADOOP-13345-019.patch, > HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch, > HADOOP-13786-HADOOP-13345-022.patch, HADOOP-13786-HADOOP-13345-023.patch, > HADOOP-13786-HADOOP-13345-024.patch, HADOOP-13786-HADOOP-13345-025.patch, > HADOOP-13786-HADOOP-13345-026.patch, HADOOP-13786-HADOOP-13345-027.patch, > HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-028.patch, > HADOOP-13786-HADOOP-13345-029.patch, HADOOP-13786-HADOOP-13345-030.patch, > HADOOP-13786-HADOOP-13345-031.patch, HADOOP-13786-HADOOP-13345-032.patch, > HADOOP-13786-HADOOP-13345-033.patch, HADOOP-13786-HADOOP-13345-035.patch, > cloud-intergration-test-failure.log, objectstore.pdf, s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927084#comment-15927084 ] Ryan Blue commented on HADOOP-13786: On metrics: our layer on top of the staging committer tracks metrics and sends them to the job committer using the same PendingCommit that already gets serialized. That's an easy way to get more data back to the job committer, which then accumulates the number of files, sizes, etc. and stores it somewhere (or logs it?). > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, > HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch, > HADOOP-13786-HADOOP-13345-012.patch, HADOOP-13786-HADOOP-13345-013.patch, > s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905876#comment-15905876 ] Ryan Blue commented on HADOOP-13786: On UUID suffixes: the option is needed for the "append" conflict resolution. Without it, you can easily overwrite files from previous writes. We also use it to identify files across partitions added in the same batch. I think it is worth keeping if this committer is to be used for more than just creating or replacing a single directory of files. > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, > HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch, > s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905498#comment-15905498 ] Ryan Blue commented on HADOOP-13786: For the staging committer drawbacks, I think there's a clear path to avoid them. The committer is not intended to instantiate its own S3Client. It does for testing, but when it is integrated with S3A it should be passed a configured client when it is instantiated, or should use package-local access to get one from the S3A FS object. In other words, the default {{findClient}} method shouldn't be used; we don't use it other than for testing. My intent was for S3A to have a {{FileSystem#newOutputCommitter(Path, JobContext)}} factory method. That way, the FS can pass its internal S3 client instead of instantiating two. The storage on local disk isn't a requirement. We can replace that with an output stream that buffers in memory and sends parts to S3 when they are ready (we're planning on doing this eventually). This is just waiting on a stable API to rely on that can close a stream, but not commit data. Since the committer API right now expects tasks to create files underneath the work path, we'll have to figure out how tasks can get a multi-part stream that is committed later without using a different method. We can also pass in a thread-pool if there is a better one to use. I think this is separate enough that it should be easy. > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, > HADOOP-13786-HADOOP-13345-010.patch, s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903352#comment-15903352 ] Ryan Blue commented on HADOOP-13786: Is there a branch where I can take a look at the S3A test issue? I can probably get them working. > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, > s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901663#comment-15901663 ] Ryan Blue commented on HADOOP-13786: Thanks for doing so much to get this in! Let me know when it is a good time to have a look at it or if you need anything. There are a few extension points that we rely on that I'd like to make sure are kept, like [getFinalOutputPath|https://github.com/rdblue/s3committer/blob/master/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L99-L113] that I would like to keep if possible. > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
[ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13786: --- Attachment: s3committer-master.zip I'm attaching the S3 committer code to show intent to contribute. > Add S3Guard committer for zero-rename commits to consistent S3 endpoints > > > Key: HADOOP-13786 > URL: https://issues.apache.org/jira/browse/HADOOP-13786 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 >Affects Versions: HADOOP-13345 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13786-HADOOP-13345-001.patch, > HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, > HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, > HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, > HADOOP-13786-HADOOP-13345-007.patch, s3committer-master.zip > > > A goal of this code is "support O(1) commits to S3 repositories in the > presence of failures". Implement it, including whatever is needed to > demonstrate the correctness of the algorithm. (that is, assuming that s3guard > provides a consistent view of the presence/absence of blobs, show that we can > commit directly). > I consider ourselves free to expose the blobstore-ness of the s3 output > streams (ie. not visible until the close()), if we need to use that to allow > us to abort commit operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550286#comment-15550286 ] Ryan Blue commented on HADOOP-13126: Brotli compression isn't splittable, but can be used with Hadoop-friendly container formats like Parquet. Using those formats is a best practice anyway, so it shouldn't matter that you can't easily split files when you use Brotli as an outer wrapper. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.5.patch > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling
[ https://issues.apache.org/jira/browse/HADOOP-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365275#comment-15365275 ] Ryan Blue edited comment on HADOOP-12878 at 7/6/16 10:55 PM: - FileInputFormat works slightly differently. First, the [split size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445] is calculated from the file's reported block size and the current min and max split sizes. Then, [the file is broken into N splits|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L410-416] that size, where {{N = Math.ceil(fileLength / splitSize)}}. The block locations are then used to determine [where each split is located|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L448], based on the split's starting offset. The result is that {{getFileBlockLocations}} can return a single location for the entire file and you'll still end up with N roughly block-sized splits. This is what enables you to get more parallelism by setting smaller split sizes, even if the resulting splits don't correspond to different blocks. In our environment, we use a 64MB S3 block size and don't see a bottleneck from one input split per file. was (Author: rdblue): FileInputFormat works slightly differently. First, the [split size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445] is calculated from the file's reported block size and the current min and max split sizes. Then, the file is broken into N splits that size, where {{N = Math.ceil(fileLength / splitSize)}}. The block locations are then used to determine where each split is located, based on the split's starting offset. The result is that {{getFileBlockLocations}} can return a single location for the entire file and you'll still end up with N roughly block-sized splits. This is what enables you to get more parallelism by setting smaller split sizes, even if the resulting splits don't correspond to different blocks. In our environment, we use a 64MB S3 block size and don't see a bottleneck from one input split per file. > Impersonate hosts in s3a for better data locality handling > -- > > Key: HADOOP-12878 > URL: https://issues.apache.org/jira/browse/HADOOP-12878 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 2.8.0 >Reporter: Thomas Demoor >Assignee: Thomas Demoor > > Currently, {{localhost}} is passed as locality for each block, causing all > blocks involved in job to initially target the same node (RM), before being > moved by the scheduler (to a rack-local node). This reduces parallelism for > jobs (with short-lived mappers). > We should mimic Azures implementation: a config setting > {{fs.s3a.block.location.impersonatedhost}} where the user can enter the list > of hostnames in the cluster to return to {{getFileBlockLocations}}. > Possible optimization: for larger systems, it might be better to return N > (5?) random hostnames to prevent passing a huge array (the downstream code > assumes size = O(3)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling
[ https://issues.apache.org/jira/browse/HADOOP-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365275#comment-15365275 ] Ryan Blue commented on HADOOP-12878: FileInputFormat works slightly differently. First, the [split size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445] is calculated from the file's reported block size and the current min and max split sizes. Then, the file is broken into N splits that size, where {{N = Math.ceil(fileLength / splitSize)}}. The block locations are then used to determine where each split is located, based on the split's starting offset. The result is that {{getFileBlockLocations}} can return a single location for the entire file and you'll still end up with N roughly block-sized splits. This is what enables you to get more parallelism by setting smaller split sizes, even if the resulting splits don't correspond to different blocks. In our environment, we use a 64MB S3 block size and don't see a bottleneck from one input split per file. > Impersonate hosts in s3a for better data locality handling > -- > > Key: HADOOP-12878 > URL: https://issues.apache.org/jira/browse/HADOOP-12878 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 2.8.0 >Reporter: Thomas Demoor >Assignee: Thomas Demoor > > Currently, {{localhost}} is passed as locality for each block, causing all > blocks involved in job to initially target the same node (RM), before being > moved by the scheduler (to a rack-local node). This reduces parallelism for > jobs (with short-lived mappers). > We should mimic Azures implementation: a config setting > {{fs.s3a.block.location.impersonatedhost}} where the user can enter the list > of hostnames in the cluster to return to {{getFileBlockLocations}}. > Possible optimization: for larger systems, it might be better to return N > (5?) random hostnames to prevent passing a huge array (the downstream code > assumes size = O(3)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.4.patch Adding a patch that fixes the checkstyle issues. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch, HADOOP-13126.4.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328634#comment-15328634 ] Ryan Blue commented on HADOOP-13126: I'm attaching a new patch that depends on jbrotli 0.5.0. That version fixes the issue I noted above where Brotli doesn't consume all of its input. The new patch also adds BrotliCodec to the codec service loader so it is available automatically when needed to read .br files. We've been testing this code for a couple weeks and it seems to be working and stable. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Comment: was deleted (was: I'm attaching a new version of this patch that depends on [~marki]'s 0.5.0 release. That fixes the bug I noted above where Brotli doesn't consume all of the input buffer. This also adds BrotliCodec to the codec service loader and tests that it is loaded correctly. We've been running tests on this code for a few weeks and it appears to be stable.) > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328619#comment-15328619 ] Ryan Blue commented on HADOOP-13126: I'm attaching a new version of this patch that depends on [~marki]'s 0.5.0 release. That fixes the bug I noted above where Brotli doesn't consume all of the input buffer. This also adds BrotliCodec to the codec service loader and tests that it is loaded correctly. We've been running tests on this code for a few weeks and it appears to be stable. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.3.patch > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, > HADOOP-13126.3.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.2.patch > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278673#comment-15278673 ] Ryan Blue commented on HADOOP-13126: The results above show the comparison with Snappy. The file is less than half the size and compression took about the same amount of time. Comparing to LZ4 would be interesting. It isn't supported by Parquet so it's a bit harder for me to drop into my test case. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278659#comment-15278659 ] Ryan Blue commented on HADOOP-13126: [~andrew.wang], you guys are probably interested in this. > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278653#comment-15278653 ] Ryan Blue commented on HADOOP-13126: [~marki], could you review this patch also? > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: (was: HADOOP-13126.1.patch) > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.1.patch > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Status: Patch Available (was: Open) > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13126) Add Brotli compression codec
[ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-13126: --- Attachment: HADOOP-13126.1.patch > Add Brotli compression codec > > > Key: HADOOP-13126 > URL: https://issues.apache.org/jira/browse/HADOOP-13126 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-13126.1.patch > > > I've been testing [Brotli|https://github.com/google/brotli/], a new > compression library based on LZ77 from Google. Google's [brotli > benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] > look really good and we're also seeing a significant improvement in > compression size, compression speed, or both. > {code:title=Brotli preliminary test results} > [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet > --compression-codec snappy --overwrite > real1m17.106s > user1m30.804s > sys 0m4.404s > [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet > --compression-codec brotli --overwrite > real1m16.640s > user1m24.244s > sys 0m6.412s > [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet > --compression-codec gzip --overwrite > real3m39.496s > user3m48.736s > sys 0m3.880s > [blue@work Downloads]$ ls -l > -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet > -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet > -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet > {code} > Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. > Another test resulted in a slightly larger Brotli file than gzip produced, > but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. > [Brotli is licensed with the MIT > license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI > library jbrotli is > ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13126) Add Brotli compression codec
Ryan Blue created HADOOP-13126: -- Summary: Add Brotli compression codec Key: HADOOP-13126 URL: https://issues.apache.org/jira/browse/HADOOP-13126 Project: Hadoop Common Issue Type: Improvement Components: io Reporter: Ryan Blue Assignee: Ryan Blue I've been testing [Brotli|https://github.com/google/brotli/], a new compression library based on LZ77 from Google. Google's [brotli benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] look really good and we're also seeing a significant improvement in compression size, compression speed, or both. {code:title=Brotli preliminary test results} [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet --compression-codec snappy --overwrite real1m17.106s user1m30.804s sys 0m4.404s [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet --compression-codec brotli --overwrite real1m16.640s user1m24.244s sys 0m6.412s [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet --compression-codec gzip --overwrite real3m39.496s user3m48.736s sys 0m3.880s [blue@work Downloads]$ ls -l -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet {code} Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. Another test resulted in a slightly larger Brotli file than gzip produced, but Brotli was 4x faster. I'd like to get this compression codec into Hadoop. [Brotli is licensed with the MIT license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI library jbrotli is ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls
[ https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150751#comment-15150751 ] Ryan Blue commented on HADOOP-12810: Thanks for reviewing and commiting this so quickly, [~vinayrpet]! > FileSystem#listLocatedStatus causes unnecessary RPC calls > - > > Key: HADOOP-12810 > URL: https://issues.apache.org/jira/browse/HADOOP-12810 > Project: Hadoop Common > Issue Type: Bug > Components: fs, fs/s3 >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.7.3 > > Attachments: HADOOP-12810.1.patch > > > {{FileSystem#listLocatedStatus}} lists the files in a directory and then > calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of > {{getFileBlockLocations(stat, ...)}}. That function with the path arg just > calls {{getFileStatus}} to get another file status from the path and calls > the file status version, so this ends up calling {{getFileStatus}} > unnecessarily. > This is particularly bad for S3, where {{getFileStatus}} is expensive. > Avoiding the extra call improved input split calculation time for a data set > in S3 by ~20x: from 10 minutes to 25 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls
[ https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-12810: --- Status: Patch Available (was: Open) > FileSystem#listLocatedStatus causes unnecessary RPC calls > - > > Key: HADOOP-12810 > URL: https://issues.apache.org/jira/browse/HADOOP-12810 > Project: Hadoop Common > Issue Type: Bug > Components: fs, fs/s3 >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-12810.1.patch > > > {{FileSystem#listLocatedStatus}} lists the files in a directory and then > calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of > {{getFileBlockLocations(stat, ...)}}. That function with the path arg just > calls {{getFileStatus}} to get another file status from the path and calls > the file status version, so this ends up calling {{getFileStatus}} > unnecessarily. > This is particularly bad for S3, where {{getFileStatus}} is expensive. > Avoiding the extra call improved input split calculation time for a data set > in S3 by ~20x: from 10 minutes to 25 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls
[ https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated HADOOP-12810: --- Attachment: HADOOP-12810.1.patch Adding a patch that fixes the problem. > FileSystem#listLocatedStatus causes unnecessary RPC calls > - > > Key: HADOOP-12810 > URL: https://issues.apache.org/jira/browse/HADOOP-12810 > Project: Hadoop Common > Issue Type: Bug > Components: fs, fs/s3 >Affects Versions: 2.7.2 >Reporter: Ryan Blue >Assignee: Ryan Blue > Attachments: HADOOP-12810.1.patch > > > {{FileSystem#listLocatedStatus}} lists the files in a directory and then > calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of > {{getFileBlockLocations(stat, ...)}}. That function with the path arg just > calls {{getFileStatus}} to get another file status from the path and calls > the file status version, so this ends up calling {{getFileStatus}} > unnecessarily. > This is particularly bad for S3, where {{getFileStatus}} is expensive. > Avoiding the extra call improved input split calculation time for a data set > in S3 by ~20x: from 10 minutes to 25 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls
Ryan Blue created HADOOP-12810: -- Summary: FileSystem#listLocatedStatus causes unnecessary RPC calls Key: HADOOP-12810 URL: https://issues.apache.org/jira/browse/HADOOP-12810 Project: Hadoop Common Issue Type: Bug Components: fs, fs/s3 Affects Versions: 2.7.2 Reporter: Ryan Blue Assignee: Ryan Blue {{FileSystem#listLocatedStatus}} lists the files in a directory and then calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of {{getFileBlockLocations(stat, ...)}}. That function with the path arg just calls {{getFileStatus}} to get another file status from the path and calls the file status version, so this ends up calling {{getFileStatus}} unnecessarily. This is particularly bad for S3, where {{getFileStatus}} is expensive. Avoiding the extra call improved input split calculation time for a data set in S3 by ~20x: from 10 minutes to 25 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)