[jira] [Commented] (HADOOP-15421) Stabilise/formalise the JSON _SUCCESS format used in the S3A committers

2018-06-06 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503612#comment-16503612
 ] 

Ryan Blue commented on HADOOP-15421:


+1 for the committer's full classname. Enough classes are shaded and relocated 
that it would be useful.

> Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
> ---
>
> Key: HADOOP-15421
> URL: https://issues.apache.org/jira/browse/HADOOP-15421
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Steve Loughran
>Priority: Major
>
> the S3A committers rely on an atomic PUT to save a JSON summary of the job to 
> the dest FS, containing files, statistics, etc. This is for internal testing, 
> but it turns out to be useful for spark integration testing, Hive, etc.
> IBM's stocator also generated a manifest.
> Proposed: come up with (an extensible) design that we are happy with as a 
> long lived format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15421) Stabilise/formalise the JSON _SUCCESS format used in the S3A committers

2018-04-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458794#comment-16458794
 ] 

Ryan Blue commented on HADOOP-15421:


I think this makes sense. As long as there is a _SUCCESS file, it may as well 
be used to pass the scope, i.e., which files were successful. What 
statistics/metrics are you adding to the file?

> Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
> ---
>
> Key: HADOOP-15421
> URL: https://issues.apache.org/jira/browse/HADOOP-15421
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Steve Loughran
>Priority: Major
>
> the S3A committers rely on an atomic PUT to save a JSON summary of the job to 
> the dest FS, containing files, statistics, etc. This is for internal testing, 
> but it turns out to be useful for spark integration testing, Hive, etc.
> IBM's stocator also generated a manifest.
> Proposed: come up with (an extensible) design that we are happy with as a 
> long lived format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2017-12-18 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295256#comment-16295256
 ] 

Ryan Blue commented on HADOOP-13126:


Support for brotli, zstd, and lz4 is in Parquet master. It will be out in the 
next release.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15107) Prove the correctness of the new committers, or fix where they are not correct

2017-12-11 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286673#comment-16286673
 ] 

Ryan Blue commented on HADOOP-15107:


For the definition of correctness, I think we will need two based on the 
possible failures that are handled. Task-level failure tolerance: the committer 
can handle any task failure, including during task commit. Job-level failure 
tolerance: the committer can handle failure during job commit. The contribution 
of the multi-part committer is that it handles task-level failure without a 
copy and minimizes the impact of job-level failure. But, it doesn't guarantee 
job-level failure if the job commit fails.

> Prove the correctness of the new committers, or fix where they are not correct
> --
>
> Key: HADOOP-15107
> URL: https://issues.apache.org/jira/browse/HADOOP-15107
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> I'm writing about the paper on the committers, one which, being a proper 
> paper, requires me to show the committers work.
> # define the requirements of a "Correct" committed job (this applies to the 
> FileOutputCommitter too)
> # show that the Staging committer meets these requirements (most of this is 
> implicit in that it uses the V1 FileOutputCommitter to marshall .pendingset 
> lists from committed tasks to the final destination, where they are read and 
> committed.
> # Show the magic committer also works.
> I'm now not sure that the magic committer works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2017-12-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280469#comment-16280469
 ] 

Ryan Blue commented on HADOOP-13126:


I'm happy to if there is interest from reviewers.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15003) Merge S3A committers into trunk: Yetus patch checker

2017-11-08 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244232#comment-16244232
 ] 

Ryan Blue commented on HADOOP-15003:


I don't think the partitioned committer should continue the _SUCCESS marker 
convention. Nothing that writes partitioned data currently depends on _SUCCESS 
markers, so it's easy to avoid the problem entirely because the markers are 
unreliable: what happens when you're appending data to a partition?

We implemented a property that allows users to opt in to have _SUCCESS created 
for the directory output committer only. It creates the _SUCCESS marker after 
all other operations have finished because that's when we can guarantee that 
the write was successful. It doesn't delete other markers because there are no 
well-defined semantics for _SUCCESS with overwrite.

> Merge S3A committers into trunk: Yetus patch checker
> 
>
> Key: HADOOP-15003
> URL: https://issues.apache.org/jira/browse/HADOOP-15003
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-041.patch, HADOOP-13786-042.patch, 
> HADOOP-13786-043.patch, HADOOP-13786-044.patch, HADOOP-13786-045.patch, 
> HADOOP-13786-046.patch
>
>
> This is a Yetus only JIRA created to have Yetus review the 
> HADOOP-13786/HADOOP-14971 patch as a .patch file, as the review PR 
> [https://github.com/apache/hadoop/pull/282] is stopping this happening in 
> HADOOP-14971.
> Reviews should go into the PR/other task



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to S3 endpoints

2017-10-12 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202748#comment-16202748
 ] 

Ryan Blue commented on HADOOP-13786:


I'll try to take a look as well.

> Add S3Guard committer for zero-rename commits to S3 endpoints
> -
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-036.patch, HADOOP-13786-037.patch, 
> HADOOP-13786-038.patch, HADOOP-13786-039.patch, 
> HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch, 
> HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch, 
> HADOOP-13786-HADOOP-13345-005.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch, 
> HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch, 
> HADOOP-13786-HADOOP-13345-011.patch, HADOOP-13786-HADOOP-13345-012.patch, 
> HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch, 
> HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch, 
> HADOOP-13786-HADOOP-13345-018.patch, HADOOP-13786-HADOOP-13345-019.patch, 
> HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch, 
> HADOOP-13786-HADOOP-13345-022.patch, HADOOP-13786-HADOOP-13345-023.patch, 
> HADOOP-13786-HADOOP-13345-024.patch, HADOOP-13786-HADOOP-13345-025.patch, 
> HADOOP-13786-HADOOP-13345-026.patch, HADOOP-13786-HADOOP-13345-027.patch, 
> HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-028.patch, 
> HADOOP-13786-HADOOP-13345-029.patch, HADOOP-13786-HADOOP-13345-030.patch, 
> HADOOP-13786-HADOOP-13345-031.patch, HADOOP-13786-HADOOP-13345-032.patch, 
> HADOOP-13786-HADOOP-13345-033.patch, HADOOP-13786-HADOOP-13345-035.patch, 
> cloud-intergration-test-failure.log, objectstore.pdf, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-15 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927084#comment-15927084
 ] 

Ryan Blue commented on HADOOP-13786:


On metrics: our layer on top of the staging committer tracks metrics and sends 
them to the job committer using the same PendingCommit that already gets 
serialized. That's an easy way to get more data back to the job committer, 
which then accumulates the number of files, sizes, etc. and stores it somewhere 
(or logs it?).

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, 
> HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch, 
> HADOOP-13786-HADOOP-13345-012.patch, HADOOP-13786-HADOOP-13345-013.patch, 
> s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905876#comment-15905876
 ] 

Ryan Blue commented on HADOOP-13786:


On UUID suffixes: the option is needed for the "append" conflict resolution. 
Without it, you can easily overwrite files from previous writes. We also use it 
to identify files across partitions added in the same batch. I think it is 
worth keeping if this committer is to be used for more than just creating or 
replacing a single directory of files.

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, 
> HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch, 
> s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905498#comment-15905498
 ] 

Ryan Blue commented on HADOOP-13786:


For the staging committer drawbacks, I think there's a clear path to avoid them.

The committer is not intended to instantiate its own S3Client. It does for 
testing, but when it is integrated with S3A it should be passed a configured 
client when it is instantiated, or should use package-local access to get one 
from the S3A FS object. In other words, the default {{findClient}} method 
shouldn't be used; we don't use it other than for testing. My intent was for 
S3A to have a {{FileSystem#newOutputCommitter(Path, JobContext)}} factory 
method. That way, the FS can pass its internal S3 client instead of 
instantiating two.

The storage on local disk isn't a requirement. We can replace that with an 
output stream that buffers in memory and sends parts to S3 when they are ready 
(we're planning on doing this eventually). This is just waiting on a stable API 
to rely on that can close a stream, but not commit data. Since the committer 
API right now expects tasks to create files underneath the work path, we'll 
have to figure out how tasks can get a multi-part stream that is committed 
later without using a different method.

We can also pass in a thread-pool if there is a better one to use. I think this 
is separate enough that it should be easy.

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, 
> HADOOP-13786-HADOOP-13345-010.patch, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903352#comment-15903352
 ] 

Ryan Blue commented on HADOOP-13786:


Is there a branch where I can take a look at the S3A test issue? I can probably 
get them working.

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch, 
> s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-08 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901663#comment-15901663
 ] 

Ryan Blue commented on HADOOP-13786:


Thanks for doing so much to get this in! Let me know when it is a good time to 
have a look at it or if you need anything. There are a few extension points 
that we rely on that I'd like to make sure are kept, like 
[getFinalOutputPath|https://github.com/rdblue/s3committer/blob/master/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L99-L113]
 that I would like to keep if possible.

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints

2017-03-07 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13786:
---
Attachment: s3committer-master.zip

I'm attaching the S3 committer code to show intent to contribute.

> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> 
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch, 
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch, 
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-007.patch, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-10-05 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550286#comment-15550286
 ] 

Ryan Blue commented on HADOOP-13126:


Brotli compression isn't splittable, but can be used with Hadoop-friendly 
container formats like Parquet. Using those formats is a best practice anyway, 
so it shouldn't matter that you can't easily split files when you use Brotli as 
an outer wrapper.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-10-05 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.5.patch

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling

2016-07-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365275#comment-15365275
 ] 

Ryan Blue edited comment on HADOOP-12878 at 7/6/16 10:55 PM:
-

FileInputFormat works slightly differently. First, the [split 
size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445]
 is calculated from the file's reported block size and the current min and max 
split sizes. Then, [the file is broken into N 
splits|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L410-416]
 that size, where {{N = Math.ceil(fileLength / splitSize)}}. The block 
locations are then used to determine [where each split is 
located|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L448],
 based on the split's starting offset.

The result is that {{getFileBlockLocations}} can return a single location for 
the entire file and you'll still end up with N roughly block-sized splits. This 
is what enables you to get more parallelism by setting smaller split sizes, 
even if the resulting splits don't correspond to different blocks. In our 
environment, we use a 64MB S3 block size and don't see a bottleneck from one 
input split per file.


was (Author: rdblue):
FileInputFormat works slightly differently. First, the [split 
size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445]
 is calculated from the file's reported block size and the current min and max 
split sizes. Then, the file is broken into N splits that size, where {{N = 
Math.ceil(fileLength / splitSize)}}. The block locations are then used to 
determine where each split is located, based on the split's starting offset.

The result is that {{getFileBlockLocations}} can return a single location for 
the entire file and you'll still end up with N roughly block-sized splits. This 
is what enables you to get more parallelism by setting smaller split sizes, 
even if the resulting splits don't correspond to different blocks. In our 
environment, we use a 64MB S3 block size and don't see a bottleneck from one 
input split per file.

> Impersonate hosts in s3a for better data locality handling
> --
>
> Key: HADOOP-12878
> URL: https://issues.apache.org/jira/browse/HADOOP-12878
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Thomas Demoor
>Assignee: Thomas Demoor
>
> Currently, {{localhost}} is passed as locality for each block, causing all 
> blocks involved in job to initially target the same node (RM), before being 
> moved by the scheduler (to a rack-local node). This reduces parallelism for 
> jobs (with short-lived mappers). 
> We should mimic Azures implementation: a config setting 
> {{fs.s3a.block.location.impersonatedhost}} where the user can enter the list 
> of hostnames in the cluster to return to {{getFileBlockLocations}}. 
> Possible optimization: for larger systems, it might be better to return N 
> (5?) random hostnames to prevent passing a huge array (the downstream code 
> assumes size = O(3)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling

2016-07-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365275#comment-15365275
 ] 

Ryan Blue commented on HADOOP-12878:


FileInputFormat works slightly differently. First, the [split 
size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445]
 is calculated from the file's reported block size and the current min and max 
split sizes. Then, the file is broken into N splits that size, where {{N = 
Math.ceil(fileLength / splitSize)}}. The block locations are then used to 
determine where each split is located, based on the split's starting offset.

The result is that {{getFileBlockLocations}} can return a single location for 
the entire file and you'll still end up with N roughly block-sized splits. This 
is what enables you to get more parallelism by setting smaller split sizes, 
even if the resulting splits don't correspond to different blocks. In our 
environment, we use a 64MB S3 block size and don't see a bottleneck from one 
input split per file.

> Impersonate hosts in s3a for better data locality handling
> --
>
> Key: HADOOP-12878
> URL: https://issues.apache.org/jira/browse/HADOOP-12878
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.8.0
>Reporter: Thomas Demoor
>Assignee: Thomas Demoor
>
> Currently, {{localhost}} is passed as locality for each block, causing all 
> blocks involved in job to initially target the same node (RM), before being 
> moved by the scheduler (to a rack-local node). This reduces parallelism for 
> jobs (with short-lived mappers). 
> We should mimic Azures implementation: a config setting 
> {{fs.s3a.block.location.impersonatedhost}} where the user can enter the list 
> of hostnames in the cluster to return to {{getFileBlockLocations}}. 
> Possible optimization: for larger systems, it might be better to return N 
> (5?) random hostnames to prevent passing a huge array (the downstream code 
> assumes size = O(3)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-06-14 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.4.patch

Adding a patch that fixes the checkstyle issues.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch, HADOOP-13126.4.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-06-13 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328634#comment-15328634
 ] 

Ryan Blue commented on HADOOP-13126:


I'm attaching a new patch that depends on jbrotli 0.5.0. That version fixes the 
issue I noted above where Brotli doesn't consume all of its input. The new 
patch also adds BrotliCodec to the codec service loader so it is available 
automatically when needed to read .br files.

We've been testing this code for a couple weeks and it seems to be working and 
stable.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HADOOP-13126) Add Brotli compression codec

2016-06-13 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Comment: was deleted

(was: I'm attaching a new version of this patch that depends on [~marki]'s 
0.5.0 release. That fixes the bug I noted above where Brotli doesn't consume 
all of the input buffer. This also adds BrotliCodec to the codec service loader 
and tests that it is loaded correctly.

We've been running tests on this code for a few weeks and it appears to be 
stable.)

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-06-13 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328619#comment-15328619
 ] 

Ryan Blue commented on HADOOP-13126:


I'm attaching a new version of this patch that depends on [~marki]'s 0.5.0 
release. That fixes the bug I noted above where Brotli doesn't consume all of 
the input buffer. This also adds BrotliCodec to the codec service loader and 
tests that it is loaded correctly.

We've been running tests on this code for a few weeks and it appears to be 
stable.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-06-13 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.3.patch

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, 
> HADOOP-13126.3.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-05-11 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.2.patch

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278673#comment-15278673
 ] 

Ryan Blue commented on HADOOP-13126:


The results above show the comparison with Snappy. The file is less than half 
the size and compression took about the same amount of time. Comparing to LZ4 
would be interesting. It isn't supported by Parquet so it's a bit harder for me 
to drop into my test case.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278659#comment-15278659
 ] 

Ryan Blue commented on HADOOP-13126:


[~andrew.wang], you guys are probably interested in this.

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278653#comment-15278653
 ] 

Ryan Blue commented on HADOOP-13126:


[~marki], could you review this patch also?

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: (was: HADOOP-13126.1.patch)

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.1.patch

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Status: Patch Available  (was: Open)

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-13126:
---
Attachment: HADOOP-13126.1.patch

> Add Brotli compression codec
> 
>
> Key: HADOOP-13126
> URL: https://issues.apache.org/jira/browse/HADOOP-13126
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-13126.1.patch
>
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new 
> compression library based on LZ77 from Google. Google's [brotli 
> benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
>  look really good and we're also seeing a significant improvement in 
> compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
> --compression-codec snappy --overwrite  
> real1m17.106s
> user1m30.804s
> sys 0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
> --compression-codec brotli --overwrite 
> real1m16.640s
> user1m24.244s
> sys 0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
> --compression-codec gzip --overwrite
> real3m39.496s
> user3m48.736s
> sys 0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
> Another test resulted in a slightly larger Brotli file than gzip produced, 
> but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT 
> license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
> library jbrotli is 
> ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13126) Add Brotli compression codec

2016-05-10 Thread Ryan Blue (JIRA)
Ryan Blue created HADOOP-13126:
--

 Summary: Add Brotli compression codec
 Key: HADOOP-13126
 URL: https://issues.apache.org/jira/browse/HADOOP-13126
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Ryan Blue
Assignee: Ryan Blue


I've been testing [Brotli|https://github.com/google/brotli/], a new compression 
library based on LZ77 from Google. Google's [brotli 
benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
 look really good and we're also seeing a significant improvement in 
compression size, compression speed, or both.

{code:title=Brotli preliminary test results}
[blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet 
--compression-codec snappy --overwrite  

real1m17.106s
user1m30.804s
sys 0m4.404s

[blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet 
--compression-codec brotli --overwrite 

real1m16.640s
user1m24.244s
sys 0m6.412s

[blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet 
--compression-codec gzip --overwrite

real3m39.496s
user3m48.736s
sys 0m3.880s

[blue@work Downloads]$ ls -l
-rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
-rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
-rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
{code}

Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. 
Another test resulted in a slightly larger Brotli file than gzip produced, but 
Brotli was 4x faster. I'd like to get this compression codec into Hadoop.

[Brotli is licensed with the MIT 
license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI 
library jbrotli is 
ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls

2016-02-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150751#comment-15150751
 ] 

Ryan Blue commented on HADOOP-12810:


Thanks for reviewing and commiting this so quickly, [~vinayrpet]!

> FileSystem#listLocatedStatus causes unnecessary RPC calls
> -
>
> Key: HADOOP-12810
> URL: https://issues.apache.org/jira/browse/HADOOP-12810
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, fs/s3
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.7.3
>
> Attachments: HADOOP-12810.1.patch
>
>
> {{FileSystem#listLocatedStatus}} lists the files in a directory and then 
> calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of 
> {{getFileBlockLocations(stat, ...)}}. That function with the path arg just 
> calls {{getFileStatus}} to get another file status from the path and calls 
> the file status version, so this ends up calling {{getFileStatus}} 
> unnecessarily.
> This is particularly bad for S3, where {{getFileStatus}} is expensive. 
> Avoiding the extra call improved input split calculation time for a data set 
> in S3 by ~20x: from 10 minutes to 25 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls

2016-02-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-12810:
---
Status: Patch Available  (was: Open)

> FileSystem#listLocatedStatus causes unnecessary RPC calls
> -
>
> Key: HADOOP-12810
> URL: https://issues.apache.org/jira/browse/HADOOP-12810
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, fs/s3
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-12810.1.patch
>
>
> {{FileSystem#listLocatedStatus}} lists the files in a directory and then 
> calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of 
> {{getFileBlockLocations(stat, ...)}}. That function with the path arg just 
> calls {{getFileStatus}} to get another file status from the path and calls 
> the file status version, so this ends up calling {{getFileStatus}} 
> unnecessarily.
> This is particularly bad for S3, where {{getFileStatus}} is expensive. 
> Avoiding the extra call improved input split calculation time for a data set 
> in S3 by ~20x: from 10 minutes to 25 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls

2016-02-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated HADOOP-12810:
---
Attachment: HADOOP-12810.1.patch

Adding a patch that fixes the problem.

> FileSystem#listLocatedStatus causes unnecessary RPC calls
> -
>
> Key: HADOOP-12810
> URL: https://issues.apache.org/jira/browse/HADOOP-12810
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, fs/s3
>Affects Versions: 2.7.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: HADOOP-12810.1.patch
>
>
> {{FileSystem#listLocatedStatus}} lists the files in a directory and then 
> calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of 
> {{getFileBlockLocations(stat, ...)}}. That function with the path arg just 
> calls {{getFileStatus}} to get another file status from the path and calls 
> the file status version, so this ends up calling {{getFileStatus}} 
> unnecessarily.
> This is particularly bad for S3, where {{getFileStatus}} is expensive. 
> Avoiding the extra call improved input split calculation time for a data set 
> in S3 by ~20x: from 10 minutes to 25 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-12810) FileSystem#listLocatedStatus causes unnecessary RPC calls

2016-02-16 Thread Ryan Blue (JIRA)
Ryan Blue created HADOOP-12810:
--

 Summary: FileSystem#listLocatedStatus causes unnecessary RPC calls
 Key: HADOOP-12810
 URL: https://issues.apache.org/jira/browse/HADOOP-12810
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs, fs/s3
Affects Versions: 2.7.2
Reporter: Ryan Blue
Assignee: Ryan Blue


{{FileSystem#listLocatedStatus}} lists the files in a directory and then calls 
{{getFileBlockLocations(stat.getPath(), ...)}} for each instead of 
{{getFileBlockLocations(stat, ...)}}. That function with the path arg just 
calls {{getFileStatus}} to get another file status from the path and calls the 
file status version, so this ends up calling {{getFileStatus}} unnecessarily.

This is particularly bad for S3, where {{getFileStatus}} is expensive. Avoiding 
the extra call improved input split calculation time for a data set in S3 by 
~20x: from 10 minutes to 25 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)