[jira] [Resolved] (HADOOP-19206) Hadoop release contains a 530MB bundle-2.23.19.jar

2024-06-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HADOOP-19206.
-
Resolution: Duplicate

Resolving this as a duplicate of HADOOP-19083.

> Hadoop release contains a 530MB bundle-2.23.19.jar
> --
>
> Key: HADOOP-19206
> URL: https://issues.apache.org/jira/browse/HADOOP-19206
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: build
>Reporter: Tsz-wo Sze
>Priority: Major
>
> The size of Hadoop binary release (v3.4.0) is 1.7 GB.
> {code:java}
> hadoop-3.4.0$du -h -d 1
> $du -h -d 1 .
> 2.0M  ./bin
> 260K  ./libexec
>  72K  ./include
> 212K  ./sbin
> 184K  ./etc
> 232K  ./licenses-binary
> 316M  ./lib
> 1.4G  ./share
> 1.7G  .
> {code}
> A large component is bundle-2.23.19.jar, which is [AWS Java SDK :: 
> Bundle|https://mvnrepository.com/artifact/software.amazon.awssdk/bundle/2.23.19]
> {code:java}
> hadoop-3.4.0$ls -lh share/hadoop/tools/lib/bundle-2.23.19.jar  
> -rw-r--r--@ 1 szetszwo  staff   530M Mar  4 15:41 
> share/hadoop/tools/lib/bundle-2.23.19.jar
> {code}
> We should revisit if such a large jar is really needed to be included in the 
> release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-19206) Hadoop release contains a 530MB bundle-2.23.19.jar

2024-06-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856639#comment-17856639
 ] 

Tsz-wo Sze commented on HADOOP-19206:
-

[~ayushtkn], thanks for pointing out HADOOP-19083.

[~ste...@apache.org], sure, I will see what can I help.

> Hadoop release contains a 530MB bundle-2.23.19.jar
> --
>
> Key: HADOOP-19206
> URL: https://issues.apache.org/jira/browse/HADOOP-19206
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: build
>Reporter: Tsz-wo Sze
>Priority: Major
>
> The size of Hadoop binary release (v3.4.0) is 1.7 GB.
> {code:java}
> hadoop-3.4.0$du -h -d 1
> $du -h -d 1 .
> 2.0M  ./bin
> 260K  ./libexec
>  72K  ./include
> 212K  ./sbin
> 184K  ./etc
> 232K  ./licenses-binary
> 316M  ./lib
> 1.4G  ./share
> 1.7G  .
> {code}
> A large component is bundle-2.23.19.jar, which is [AWS Java SDK :: 
> Bundle|https://mvnrepository.com/artifact/software.amazon.awssdk/bundle/2.23.19]
> {code:java}
> hadoop-3.4.0$ls -lh share/hadoop/tools/lib/bundle-2.23.19.jar  
> -rw-r--r--@ 1 szetszwo  staff   530M Mar  4 15:41 
> share/hadoop/tools/lib/bundle-2.23.19.jar
> {code}
> We should revisit if such a large jar is really needed to be included in the 
> release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-19208) ABFS: Fixing logic to determine HNS nature of account to avoid extra getAcl() calls

2024-06-20 Thread Rakesh Radhakrishnan (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Radhakrishnan updated HADOOP-19208:
--
   Fix Version/s: (was: 3.5.0)
  (was: 3.4.1)
Target Version/s: 3.5.0, 3.4.1  (was: 3.4.1)
  Status: Patch Available  (was: Open)

> ABFS: Fixing logic to determine HNS nature of account to avoid extra getAcl() 
> calls
> ---
>
> Key: HADOOP-19208
> URL: https://issues.apache.org/jira/browse/HADOOP-19208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/azure
>Affects Versions: 3.4.0
>Reporter: Anuj Modi
>Assignee: Anuj Modi
>Priority: Major
>
> ABFS driver needs to know the type of account being used. It relies on the 
> user to inform the account type using the config 
> `fs.azure.account.hns.enabled`.
> If not configured, driver makes a getAcl call to determine the account type.
> Expectation is getAcl() will fail with 400 Bad Request if made on the FNS 
> Account.
> For any other case including 200, 404 it will indicate account is HNS.
> Today, when determining this, the logic only checks status code to be either 
> 200 or 400. In case of 404, nothing is inferred, and this leads to more 
> getAcl again and again until 200 or 400 comes.
> Fix is to update the logic such that if getAcl() fails with 400, it is FNS 
> account. For all other cases it will be an HNS account. In case of 
> throttling, if all retries are exhausted, FS init itself will fail.
> This is also to fix a test case failing on trunk. 
> {{testGetAclCallOnHnsConfigAbsence(org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemInitAndCreate)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-19120) [ABFS]: ApacheHttpClient adaptation as network library

2024-06-20 Thread Pranav Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Saxena updated HADOOP-19120:
---
Status: Patch Available  (was: Open)

https://github.com/apache/hadoop/pull/6633

> [ABFS]: ApacheHttpClient adaptation as network library
> --
>
> Key: HADOOP-19120
> URL: https://issues.apache.org/jira/browse/HADOOP-19120
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/azure
>Affects Versions: 3.5.0
>Reporter: Pranav Saxena
>Assignee: Pranav Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 3.4.1
>
>
> Apache HttpClient is more feature-rich and flexible and gives application 
> more granular control over networking parameter.
> ABFS currently relies on the JDK-net library. This library is managed by 
> OpenJDK and has no performance problem. However, it limits the application's 
> control over networking, and there are very few APIs and hooks exposed that 
> the application can use to get metrics, choose which and when a connection 
> should be reused. ApacheHttpClient will give important hooks to fetch 
> important metrics and control networking parameters.
> A custom implementation of connection-pool is used. The implementation is 
> adapted from the JDK8 connection pooling. Reasons for doing it:
> 1. PoolingHttpClientConnectionManager heuristic caches all the reusable 
> connections it has created. JDK's implementation only caches limited number 
> of connections. The limit is given by JVM system property 
> "http.maxConnections". If there is no system-property, it defaults to 5. 
> Connection-establishment latency increased with all the connections were 
> cached. Hence, adapting the pooling heuristic of JDK netlib,
> 2. In PoolingHttpClientConnectionManager, it expects the application to 
> provide `setMaxPerRoute` and `setMaxTotal`, which the implementation uses as 
> the total number of connections it can create. For application using ABFS, it 
> is not feasible to provide a value in the initialisation of the 
> connectionManager. JDK's implementation has no cap on the number of 
> connections it can have opened on a moment. Hence, adapting the pooling 
> heuristic of JDK netlib,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-19120) [ABFS]: ApacheHttpClient adaptation as network library

2024-06-20 Thread Pranav Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Saxena updated HADOOP-19120:
---
Fix Version/s: 3.5.0
   3.4.1

> [ABFS]: ApacheHttpClient adaptation as network library
> --
>
> Key: HADOOP-19120
> URL: https://issues.apache.org/jira/browse/HADOOP-19120
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/azure
>Affects Versions: 3.5.0
>Reporter: Pranav Saxena
>Assignee: Pranav Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 3.4.1
>
>
> Apache HttpClient is more feature-rich and flexible and gives application 
> more granular control over networking parameter.
> ABFS currently relies on the JDK-net library. This library is managed by 
> OpenJDK and has no performance problem. However, it limits the application's 
> control over networking, and there are very few APIs and hooks exposed that 
> the application can use to get metrics, choose which and when a connection 
> should be reused. ApacheHttpClient will give important hooks to fetch 
> important metrics and control networking parameters.
> A custom implementation of connection-pool is used. The implementation is 
> adapted from the JDK8 connection pooling. Reasons for doing it:
> 1. PoolingHttpClientConnectionManager heuristic caches all the reusable 
> connections it has created. JDK's implementation only caches limited number 
> of connections. The limit is given by JVM system property 
> "http.maxConnections". If there is no system-property, it defaults to 5. 
> Connection-establishment latency increased with all the connections were 
> cached. Hence, adapting the pooling heuristic of JDK netlib,
> 2. In PoolingHttpClientConnectionManager, it expects the application to 
> provide `setMaxPerRoute` and `setMaxTotal`, which the implementation uses as 
> the total number of connections it can create. For application using ABFS, it 
> is not feasible to provide a value in the initialisation of the 
> connectionManager. JDK's implementation has no cap on the number of 
> connections it can have opened on a moment. Hence, adapting the pooling 
> heuristic of JDK netlib,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-19203) WrappedIO BulkDelete API to raise IOEs as UncheckedIOExceptions

2024-06-20 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-19203.
-
Fix Version/s: 3.4.1
   Resolution: Fixed

> WrappedIO BulkDelete API to raise IOEs as UncheckedIOExceptions
> ---
>
> Key: HADOOP-19203
> URL: https://issues.apache.org/jira/browse/HADOOP-19203
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 3.4.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 3.4.1
>
>
> It's easier to invoke methods through reflection through parquet/iceberg 
> DynMethods if the invoked method raises unchecked exceptions, because it 
> doesn't then rewrape the raised exception in a generic RuntimeException
> Catching the IOEs and wrapping as UncheckedIOEs makes it much easier to 
> unwrap IOEs after the invocation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-19208) ABFS: Fixing logic to determine HNS nature of account to avoid extra getAcl() calls

2024-06-20 Thread Anuj Modi (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuj Modi updated HADOOP-19208:
---
Description: 
ABFS driver needs to know the type of account being used. It relies on the user 
to inform the account type using the config `fs.azure.account.hns.enabled`.
If not configured, driver makes a getAcl call to determine the account type.

Expectation is getAcl() will fail with 400 Bad Request if made on the FNS 
Account.
For any other case including 200, 404 it will indicate account is HNS.

Today, when determining this, the logic only checks status code to be either 
200 or 400. In case of 404, nothing is inferred, and this leads to more getAcl 
again and again until 200 or 400 comes.

Fix is to update the logic such that if getAcl() fails with 400, it is FNS 
account. For all other cases it will be an HNS account. In case of throttling, 
if all retries are exhausted, FS init itself will fail.

This is also to fix a test case failing on trunk. 
{{testGetAclCallOnHnsConfigAbsence(org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemInitAndCreate)}}

  was:
ABFS driver needs to know the type of account being used. It relies on the user 
to inform the account type using the config `fs.azure.account.hns.enabled`.
If not configured, driver makes a getAcl call to determine the account type.

Expectation is getAcl() will fail with 400 Bad Request if made on the FNS 
Account.
For any other case including 200, 404 it will indicate account is HNS.

Today, when determining this, the logic only checks status code to be either 
200 or 400. In case of 404, nothing is inferred, and this leads to more getAcl 
again and again until 200 or 400 comes.

Fix is to update the logic such that if getAcl() fails with 400, it is FNS 
account. For all other cases it will be an HNS account. In case of throttling, 
if all retries are exhausted, FS init itself will fail.


> ABFS: Fixing logic to determine HNS nature of account to avoid extra getAcl() 
> calls
> ---
>
> Key: HADOOP-19208
> URL: https://issues.apache.org/jira/browse/HADOOP-19208
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/azure
>Affects Versions: 3.4.0
>Reporter: Anuj Modi
>Assignee: Anuj Modi
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> ABFS driver needs to know the type of account being used. It relies on the 
> user to inform the account type using the config 
> `fs.azure.account.hns.enabled`.
> If not configured, driver makes a getAcl call to determine the account type.
> Expectation is getAcl() will fail with 400 Bad Request if made on the FNS 
> Account.
> For any other case including 200, 404 it will indicate account is HNS.
> Today, when determining this, the logic only checks status code to be either 
> 200 or 400. In case of 404, nothing is inferred, and this leads to more 
> getAcl again and again until 200 or 400 comes.
> Fix is to update the logic such that if getAcl() fails with 400, it is FNS 
> account. For all other cases it will be an HNS account. In case of 
> throttling, if all retries are exhausted, FS init itself will fail.
> This is also to fix a test case failing on trunk. 
> {{testGetAclCallOnHnsConfigAbsence(org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemInitAndCreate)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org