[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-29 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1113172048

   merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-13 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1097968517

   @mehakmeet thanks, yes, sounds like it. file a JIRA 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-12 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1096910060

   thanks for the reviews; updated the pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-07 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1091538089

   ok, that suggestion from thomas about having checksum fs pass down is wrong, 
as it means the opened file is bypassing checks.
   
   i can't see a good way of passing down the openfile options while still 
using the existing open() call, so I'm going to revert on the basis that "the 
key role here is protecting data in hdfs". no options will get down to the 
local/raw local fs, but that's ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-06 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1090315539

   testing, 
   
   * s3 london, markers keep, scale
   * azure cardiff (to make sure I've not broken anything there in the move to 
openFile() in distcp)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-06 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1090308409

   really need reviews of this @mukund-thakur @mehakmeet @bibinchundatt 
@dannycjones @surendralilhore
   
   This patch needs to go in before any other input stream optimisations so 
that 
   1. we can cut that HEAD request overhead on small files
   2.  distcp and fsshell can tell the streams that they are reading the whole 
file, so they should do big reads and expect no backwards seek.
   3. parquet and orc libs can switch to this to get 
   
   although #2975 sets it up, this PR doesn't include abfs in handling the file 
length option as an alternative to the file status.
   
   I've looked at it but need a plan about etag tracking. we will have to 
replicate the bit in the s3a code where the first GET's etag is picked up and 
used from then on. A future piece of work. This PR does contain the tests that 
are needed there though...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-04 Thread GitBox


steveloughran commented on PR #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1087397803

   ```
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:29:import
 java.util.concurrent.CompletableFuture;:8: Unused import - 
java.util.concurrent.CompletableFuture. [UnusedImports]
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:35:import
 org.apache.hadoop.fs.impl.AbstractFSBuilderImpl;:8: Unused import - 
org.apache.hadoop.fs.impl.AbstractFSBuilderImpl. [UnusedImports]
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:37:import
 org.apache.hadoop.fs.impl.OpenFileParameters;:8: Unused import - 
org.apache.hadoop.fs.impl.OpenFileParameters. [UnusedImports]
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:44:import
 org.apache.hadoop.util.LambdaUtils;:8: Unused import - 
org.apache.hadoop.util.LambdaUtils. [UnusedImports]
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:47:import
 static 
org.apache.hadoop.fs.Options.OpenFileOptions.FS_OPTION_OPENFILE_STANDARD_OPTIONS;:15:
 Unused import - 
org.apache.hadoop.fs.Options.OpenFileOptions.FS_OPTION_OPENFILE_STANDARD_OPTIONS.
 [UnusedImports]
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/impl/IOStatisticsBinding.java:528:
  /**: First sentence should end with a period. [JavadocStyle]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java:618:
  public static int DEFAULT_ASYNC_DRAIN_THRESHOLD = 16_000;:21: Name 
'DEFAULT_ASYNC_DRAIN_THRESHOLD' must match pattern '^[a-z][a-zA-Z0-9]*$'. 
[StaticVariableName]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java:618:
  public static int DEFAULT_ASYNC_DRAIN_THRESHOLD = 16_000;:21: Variable 
'DEFAULT_ASYNC_DRAIN_THRESHOLD' must be private and have accessor methods. 
[VisibilityModifier]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:117:
@Retries.OnceTranslated: 'method def modifier' has incorrect indentation 
level 4, expected level should be 2. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:121:
  try (DurationInfo ignored = new DurationInfo(LOG, false, "%s", action)) 
{: 'try' has incorrect indentation level 6, expected level should be 4. 
[Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:122:
return operation.apply();: 'try' child has incorrect indentation level 
8, expected level should be 6. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:123:
  } catch (AmazonClientException e) {: 'try rcurly' has incorrect 
indentation level 6, expected level should be 4. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:124:
throw S3AUtils.translateException(action, path, e);: 'catch' child has 
incorrect indentation level 8, expected level should be 6. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:125:
  }: 'catch rcurly' has incorrect indentation level 6, expected level 
should be 4. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:126:
}: 'method def rcurly' has incorrect indentation level 4, expected level 
should be 2. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:137:
@Retries.OnceTranslated: 'method def modifier' has incorrect indentation 
level 4, expected level should be 2. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:144:
  try {: 'try' has incorrect indentation level 6, expected level should be 
4. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:145:
return invokeTrackingDuration(tracker, operation);: 'try' child has 
incorrect indentation level 8, expected level should be 6. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:146:
  } catch (AmazonClientException e) {: 'try rcurly' has incorrect 
indentation level 6, expected level should be 4. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:147:
throw S3AUtils.translateException(action, path, e);: 'catch' child has 
incorrect indentation level 8, expected level should be 6. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:148:
  }: 'catch rcurly' has incorrect indentation level 6, expected level 
should be 4. [Indentation]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java:149:
}: 'method def 

[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-04-01 Thread GitBox


steveloughran commented on pull request #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1086028402


   * rebase against trunk
   * s3a input stream will drain the inner stream asynchronously in 
seek/unbuffer related calls if the # of bytes to drain is > a new 
config/openFile option `fs.s3a.input.async.drain.threshold`; default is 16000 
bytes, which seems a good number in my long haul experiments. also draining is 
into a 16k byte buffer, which speeds it up.
   
   This aims to reduce the cost of seeking where bytes do need to be discarded.
   
   the time to drain is also measured. it's the max time which can be high, as 
it is time to read bytes remaining in the current read. abort is less expensive 
in the actual abort...its the negotiation of a new TLS channel later which 
costs.
   
   ```
   (stream_read_remote_stream_aborted.mean=(samples=2, sum=2, mean=1.))
   (stream_read_remote_stream_drain.mean=(samples=13, sum=29, mean=2.2308)));
   
   (stream_read_remote_stream_aborted.max=1)
   (stream_read_remote_stream_drain.max=25));
   ```
   
   the `ITestS3AInputStreamPerformance` suite also sets the file length in 
every openFile call. so skips all creation, saves a few seconds overall, 
showing it is tangible.
   
   full stats from a remote testrun
   
   ```
   
   2022-04-01 15:49:10,887 [JUnit] INFO  s3a.AbstractS3ATestBase 
(AbstractS3ATestBase.java:dumpFileSystemIOStatistics(123)) - Aggregate 
FileSystem Statistics counters=((action_executor_acquired=1)
   (action_file_opened=8)
   (action_http_get_request=15)
   (action_http_head_request=25)
   (audit_request_execution=76)
   (audit_span_creation=50)
   (directories_created=8)
   (directories_deleted=7)
   (fake_directories_deleted=1)
   (files_created=1)
   (files_deleted=1)
   (object_bulk_delete_request=2)
   (object_delete_objects=9)
   (object_delete_request=7)
   (object_list_request=18)
   (object_metadata_request=25)
   (object_put_bytes=32768)
   (object_put_request=9)
   (object_put_request_completed=9)
   (op_create=1)
   (op_delete=8)
   (op_get_file_status=9)
   (op_mkdirs=8)
   (op_open=8)
   (store_io_request=78)
   (stream_aborted=2)
   (stream_read_bytes=93473433)
   (stream_read_bytes_backwards_on_seek=12713984)
   (stream_read_bytes_discarded_in_abort=43889622)
   (stream_read_bytes_discarded_in_close=252395)
   (stream_read_close_operations=8)
   (stream_read_closed=13)
   (stream_read_fully_operations=8)
   (stream_read_opened=15)
   (stream_read_operations=6124)
   (stream_read_operations_incomplete=6071)
   (stream_read_remote_stream_aborted=2)
   (stream_read_remote_stream_drain=13)
   (stream_read_seek_backward_operations=4)
   (stream_read_seek_bytes_discarded=45092691)
   (stream_read_seek_bytes_skipped=55054163)
   (stream_read_seek_forward_operations=175)
   (stream_read_seek_operations=179)
   (stream_read_seek_policy_changed=9)
   (stream_read_total_bytes=138818519)
   (stream_write_block_uploads=1)
   (stream_write_bytes=32768)
   (stream_write_total_data=65536));
   
   gauges=((stream_write_block_uploads_pending=1));
   
   minimums=((action_executor_acquired.min=0)
   (action_file_opened.min=0)
   (action_http_get_request.min=31)
   (action_http_head_request.min=21)
   (object_bulk_delete_request.min=37)
   (object_delete_request.min=28)
   (object_list_request.min=27)
   (object_put_request.min=60)
   (op_create.min=61)
   (op_delete.min=28)
   (op_get_file_status.min=35)
   (op_mkdirs.min=155)
   (stream_read_remote_stream_aborted.min=1)
   (stream_read_remote_stream_drain.min=0));
   
   maximums=((action_executor_acquired.max=0)
   (action_file_opened.max=0)
   (action_http_get_request.max=730)
   (action_http_head_request.max=1663)
   (object_bulk_delete_request.max=84)
   (object_delete_request.max=35)
   (object_list_request.max=648)
   (object_put_request.max=205)
   (op_create.max=61)
   (op_delete.max=159)
   (op_get_file_status.max=1669)
   (op_mkdirs.max=769)
   (stream_read_remote_stream_aborted.max=1)
   (stream_read_remote_stream_drain.max=25));
   
   means=((action_executor_acquired.mean=(samples=1, sum=0, mean=0.))
   (action_file_opened.mean=(samples=8, sum=0, mean=0.))
   (action_http_get_request.mean=(samples=15, sum=2752, mean=183.4667))
   (action_http_head_request.mean=(samples=25, sum=7360, mean=294.4000))
   (object_bulk_delete_request.mean=(samples=2, sum=121, mean=60.5000))
   (object_delete_request.mean=(samples=7, sum=213, mean=30.4286))
   (object_list_request.mean=(samples=18, sum=1520, mean=84.))
   (object_put_request.mean=(samples=9, sum=797, mean=88.5556))
   (op_create.mean=(samples=1, sum=61, mean=61.))
   (op_delete.mean=(samples=8, sum=373, mean=46.6250))
   (op_get_file_status.mean=(samples=9, sum=6793, mean=754.7778))
   (op_mkdirs.mean=(samples=8, sum=2006, mean=250.7500))
   (stream_read_remote_stream_aborted.mean=(samples=2, sum=2, mean=1.))
   (stream_read_remote_stream_drain.mean=(samples=13, sum=29, 

[GitHub] [hadoop] steveloughran commented on pull request #2584: HADOOP-16202. Enhance openFile() for better read performance against object stores

2022-03-22 Thread GitBox


steveloughran commented on pull request #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-1075138032


   checkstyle
   
   ```
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:120:import
 org.apache.hadoop.fs.s3a.select.InternalSelectConstants;:8: Unused import - 
org.apache.hadoop.fs.s3a.select.InternalSelectConstants. [UnusedImports]
   
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:194:import
 static 
org.apache.hadoop.fs.impl.AbstractFSBuilderImpl.rejectUnknownMandatoryKeys;:15: 
Unused import - 
org.apache.hadoop.fs.impl.AbstractFSBuilderImpl.rejectUnknownMandatoryKeys. 
[UnusedImports]'
   ```
   
   javac warnings
   ```
   
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:4099:14:[deprecation]
 getDefaultBlockSize() in FileSystem has been deprecated
   
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/CopyFromLocalOperation.java:235:16:[unchecked]
 unchecked method invocation: method sort in interface List is applied to given 
types
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org