[ 
https://issues.apache.org/jira/browse/HADOOP-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631774#comment-16631774
 ] 

Allen Wittenauer commented on HADOOP-15797:
-------------------------------------------

First, let's put the side the issue of s3guard. It breaks things, as we'll see 
in a bit.

Second, let's also remember that the whole point of this code is to pull things 
OUT of the default classpath.  

Now, what does 'builtin' and 'optional' mean?

builtin = required by a command.  For example, hadoop distcp requires it's jar 
at runtime.  It is not needed any other time, so it doesn't make sense to put 
it AND ANY DEPENDENCIES on the classpath all the time.

optional = optional features the USER wants to enable.  All of these features 
need to always be available at runtime. Prior to s3guard, this was ALL of the 
non-core file systems: S3, Azure, etc, etc.  Users enable these features using 
the HADOOP_OPTIONAL_TOOLS environment variable. Again, if I don't access S3 
from my cluster, I don't want the AWS jars AND ANY DEPENDENCIES on the 
classpath.  

It's also worthwhile pointing out that removing all of these jars from the 
default classpath, in addition to allowing more user freedom, greatly speeds 
the system up when measured across all java launches.

That said, it is now easy to see the problem that s3guard presents and how it 
is an outlier.  s3guard is a built-in command that depends upon components are 
also optional. IMO: using s3guard to determine any sort of functionality for 
the rest of the system is completely and totally wrong.

That said, what makes anyone think that "hadoop_add_to_classpath_tools 
hadoop-azure" should work?  optional bits come as shellprofiles, not as hooks 
for built-ins.  I mean the documentation here literally says:

{code}
## @description  Run libexec/tools/module.sh to add to the classpath
## @description  environment
{code}

If you want per-user settings for this (which is also weird, but whatever), 
then modifying .hadoop-env is the way to go.

> optional / builtin modules confused for cloud storage
> -----------------------------------------------------
>
>                 Key: HADOOP-15797
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15797
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/adl, fs/azure, fs/s3
>    Affects Versions: 3.2.0, 3.1.1
>            Reporter: Sean Mackrory
>            Priority: Major
>
> Throwing this in your .hadooprc results in hadoop-aws being in the classpath 
> but not hadoop-azure*:
> {quote}
> hadoop_add_to_classpath_tools hadoop-aws
> hadoop_add_to_classpath_tools hadoop-azure
> hadoop_add_to_classpath_tools hadoop-azure-datalake
> {quote}
> It would seem that the core issue is that that requires the module to have 
> listed it's dependencies in MODULE_NAME.tools-builtin.txt, whereas the Azure 
> connectors only have them listed in MODULE_NAME.tools-optional.txt. S3 does 
> both, and there's a comment in it's POM about how it needs to do this because 
> of the "hadoop s3guard" CLI.
> Maybe there's some history that I'm missing here, but I think what's wrong 
> here is that hadoop_add_to_classpath should get what it needs from optional 
> modules. builtin modules shouldn't even need hadoop_add_to_classpath to be 
> added anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to