[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1113172048 merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1097968517 @mehakmeet thanks, yes, sounds like it. file a JIRA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1096910060 thanks for the reviews; updated the pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1091538089 ok, that suggestion from thomas about having checksum fs pass down is wrong, as it means the opened file is bypassing checks. i can't see a good way of passing down the openfile options while still using the existing open() call, so I'm going to revert on the basis that "the key role here is protecting data in hdfs". no options will get down to the local/raw local fs, but that's ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1090315539 testing, * s3 london, markers keep, scale * azure cardiff (to make sure I've not broken anything there in the move to openFile() in distcp) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1090308409 really need reviews of this @mukund-thakur @mehakmeet @bibinchundatt @dannycjones @surendralilhore This patch needs to go in before any other input stream optimisations so that 1. we can cut that HEAD request overhead on small files 2. distcp and fsshell can tell the streams that they are reading the whole file, so they should do big reads and expect no backwards seek. 3. parquet and orc libs can switch to this to get although #2975 sets it up, this PR doesn't include abfs in handling the file length option as an alternative to the file status. I've looked at it but need a plan about etag tracking. we will have to replicate the bit in the s3a code where the first GET's etag is picked up and used from then on. A future piece of work. This PR does contain the tests that are needed there though... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on PR #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1087397803 ``` ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:29:import java.util.concurrent.CompletableFuture;:8: Unused import - java.util.concurrent.CompletableFuture. [UnusedImports] ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:35:import org.apache.hadoop.fs.impl.AbstractFSBuilderImpl;:8: Unused import - org.apache.hadoop.fs.impl.AbstractFSBuilderImpl. [UnusedImports] ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:37:import org.apache.hadoop.fs.impl.OpenFileParameters;:8: Unused import - org.apache.hadoop.fs.impl.OpenFileParameters. [UnusedImports] ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:44:import org.apache.hadoop.util.LambdaUtils;:8: Unused import - org.apache.hadoop.util.LambdaUtils. [UnusedImports] ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:47:import static org.apache.hadoop.fs.Options.OpenFileOptions.FS_OPTION_OPENFILE_STANDARD_OPTIONS;:15: Unused import - org.apache.hadoop.fs.Options.OpenFileOptions.FS_OPTION_OPENFILE_STANDARD_OPTIONS. [UnusedImports] ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/impl/IOStatisticsBinding.java:528: /**: First sentence should end with a period. [JavadocStyle] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java:618: public static int DEFAULT_ASYNC_DRAIN_THRESHOLD = 16_000;:21: Name 'DEFAULT_ASYNC_DRAIN_THRESHOLD' must match pattern '^[a-z][a-zA-Z0-9]*$'. [StaticVariableName] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java:618: public static int DEFAULT_ASYNC_DRAIN_THRESHOLD = 16_000;:21: Variable 'DEFAULT_ASYNC_DRAIN_THRESHOLD' must be private and have accessor methods. [VisibilityModifier] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:117: @Retries.OnceTranslated: 'method def modifier' has incorrect indentation level 4, expected level should be 2. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:121: try (DurationInfo ignored = new DurationInfo(LOG, false, "%s", action)) {: 'try' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:122: return operation.apply();: 'try' child has incorrect indentation level 8, expected level should be 6. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:123: } catch (AmazonClientException e) {: 'try rcurly' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:124: throw S3AUtils.translateException(action, path, e);: 'catch' child has incorrect indentation level 8, expected level should be 6. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:125: }: 'catch rcurly' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:126: }: 'method def rcurly' has incorrect indentation level 4, expected level should be 2. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:137: @Retries.OnceTranslated: 'method def modifier' has incorrect indentation level 4, expected level should be 2. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:144: try {: 'try' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:145: return invokeTrackingDuration(tracker, operation);: 'try' child has incorrect indentation level 8, expected level should be 6. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:146: } catch (AmazonClientException e) {: 'try rcurly' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:147: throw S3AUtils.translateException(action, path, e);: 'catch' child has incorrect indentation level 8, expected level should be 6. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:148: }: 'catch rcurly' has incorrect indentation level 6, expected level should be 4. [Indentation] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:149: }: 'method def
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on pull request #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1086028402 * rebase against trunk * s3a input stream will drain the inner stream asynchronously in seek/unbuffer related calls if the # of bytes to drain is > a new config/openFile option `fs.s3a.input.async.drain.threshold`; default is 16000 bytes, which seems a good number in my long haul experiments. also draining is into a 16k byte buffer, which speeds it up. This aims to reduce the cost of seeking where bytes do need to be discarded. the time to drain is also measured. it's the max time which can be high, as it is time to read bytes remaining in the current read. abort is less expensive in the actual abort...its the negotiation of a new TLS channel later which costs. ``` (stream_read_remote_stream_aborted.mean=(samples=2, sum=2, mean=1.)) (stream_read_remote_stream_drain.mean=(samples=13, sum=29, mean=2.2308))); (stream_read_remote_stream_aborted.max=1) (stream_read_remote_stream_drain.max=25)); ``` the `ITestS3AInputStreamPerformance` suite also sets the file length in every openFile call. so skips all creation, saves a few seconds overall, showing it is tangible. full stats from a remote testrun ``` 2022-04-01 15:49:10,887 [JUnit] INFO s3a.AbstractS3ATestBase (AbstractS3ATestBase.java:dumpFileSystemIOStatistics(123)) - Aggregate FileSystem Statistics counters=((action_executor_acquired=1) (action_file_opened=8) (action_http_get_request=15) (action_http_head_request=25) (audit_request_execution=76) (audit_span_creation=50) (directories_created=8) (directories_deleted=7) (fake_directories_deleted=1) (files_created=1) (files_deleted=1) (object_bulk_delete_request=2) (object_delete_objects=9) (object_delete_request=7) (object_list_request=18) (object_metadata_request=25) (object_put_bytes=32768) (object_put_request=9) (object_put_request_completed=9) (op_create=1) (op_delete=8) (op_get_file_status=9) (op_mkdirs=8) (op_open=8) (store_io_request=78) (stream_aborted=2) (stream_read_bytes=93473433) (stream_read_bytes_backwards_on_seek=12713984) (stream_read_bytes_discarded_in_abort=43889622) (stream_read_bytes_discarded_in_close=252395) (stream_read_close_operations=8) (stream_read_closed=13) (stream_read_fully_operations=8) (stream_read_opened=15) (stream_read_operations=6124) (stream_read_operations_incomplete=6071) (stream_read_remote_stream_aborted=2) (stream_read_remote_stream_drain=13) (stream_read_seek_backward_operations=4) (stream_read_seek_bytes_discarded=45092691) (stream_read_seek_bytes_skipped=55054163) (stream_read_seek_forward_operations=175) (stream_read_seek_operations=179) (stream_read_seek_policy_changed=9) (stream_read_total_bytes=138818519) (stream_write_block_uploads=1) (stream_write_bytes=32768) (stream_write_total_data=65536)); gauges=((stream_write_block_uploads_pending=1)); minimums=((action_executor_acquired.min=0) (action_file_opened.min=0) (action_http_get_request.min=31) (action_http_head_request.min=21) (object_bulk_delete_request.min=37) (object_delete_request.min=28) (object_list_request.min=27) (object_put_request.min=60) (op_create.min=61) (op_delete.min=28) (op_get_file_status.min=35) (op_mkdirs.min=155) (stream_read_remote_stream_aborted.min=1) (stream_read_remote_stream_drain.min=0)); maximums=((action_executor_acquired.max=0) (action_file_opened.max=0) (action_http_get_request.max=730) (action_http_head_request.max=1663) (object_bulk_delete_request.max=84) (object_delete_request.max=35) (object_list_request.max=648) (object_put_request.max=205) (op_create.max=61) (op_delete.max=159) (op_get_file_status.max=1669) (op_mkdirs.max=769) (stream_read_remote_stream_aborted.max=1) (stream_read_remote_stream_drain.max=25)); means=((action_executor_acquired.mean=(samples=1, sum=0, mean=0.)) (action_file_opened.mean=(samples=8, sum=0, mean=0.)) (action_http_get_request.mean=(samples=15, sum=2752, mean=183.4667)) (action_http_head_request.mean=(samples=25, sum=7360, mean=294.4000)) (object_bulk_delete_request.mean=(samples=2, sum=121, mean=60.5000)) (object_delete_request.mean=(samples=7, sum=213, mean=30.4286)) (object_list_request.mean=(samples=18, sum=1520, mean=84.)) (object_put_request.mean=(samples=9, sum=797, mean=88.5556)) (op_create.mean=(samples=1, sum=61, mean=61.)) (op_delete.mean=(samples=8, sum=373, mean=46.6250)) (op_get_file_status.mean=(samples=9, sum=6793, mean=754.7778)) (op_mkdirs.mean=(samples=8, sum=2006, mean=250.7500)) (stream_read_remote_stream_aborted.mean=(samples=2, sum=2, mean=1.)) (stream_read_remote_stream_drain.mean=(samples=13, sum=29,
[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores
steveloughran commented on pull request #2584: URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1075138032 checkstyle ``` ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:120:import org.apache.hadoop.fs.s3a.select.InternalSelectConstants;:8: Unused import - org.apache.hadoop.fs.s3a.select.InternalSelectConstants. [UnusedImports] ./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:194:import static org.apache.hadoop.fs.impl.AbstractFSBuilderImpl.rejectUnknownMandatoryKeys;:15: Unused import - org.apache.hadoop.fs.impl.AbstractFSBuilderImpl.rejectUnknownMandatoryKeys. [UnusedImports]' ``` javac warnings ``` hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:4099:14:[deprecation] getDefaultBlockSize() in FileSystem has been deprecated hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/CopyFromLocalOperation.java:235:16:[unchecked] unchecked method invocation: method sort in interface List is applied to given types ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org