Re: Spark Dataset API for secondary sorting

2019-12-24 Thread Akira Ajisaka
Hi Daniel,

This is the user mailing list for Apache Hadoop, not Apache Spark.
Please use  instead.
https://spark.apache.org/community.html

-Akira

On Tue, Dec 3, 2019 at 1:00 AM Daniel Zhang  wrote:

> Hi, Spark Users:
>
> I have a question related to the way I use the spark Dataset API for my
> case.
>
> If the "ds_old" dataset is having 100 records, with 10 unique $"col1", and
> for the following pseudo-code:
>
> val ds_new = ds_old.repartition(5, 
> $"col1").sortWithinPartitions($"col2").mapPartitions(new MergeFuc)
> class MergeFun extends MapPartitionsFunction[InputCaseClass, OutputCaseClass] 
> {
>   override def call(input: util.Iterator[InputCaseClass]): 
> util.Iterator[OutputCaseClass] = {}}
>
>
> I have some questions related to "partition" defined in the above API, and
> below is my understanding:
>
> 1) repartition(5, $"col1") means distributing all 100 records based on 10
> unique col1 values to 5 partitions. There is no guarantee each of these 5
> partitions will have how many/which unique col1 value, but in a
> well-balanced hash algorithm, each partition will have close to the average
> count (10/5 = 2) for a large unique count of values.
> 2) sortWithPartitions($"col2) is one of the parts I want to clear out
> here. What is exactly the sortWithPartitions meaning here? I want the data
> sorted by "col2" within each unique value of "col1" here, but the Spark API
> uses the "partition" term so much in this case. I DON'T WANT the 100
> records sorted within each of the 5 partitions, but within each unique of
> "col1". I believe this assumption is right, as we use "repartition" with
> "col1" first. Please confirm this.
> 3) mapPartitions(new MergeFuc) is another part I want to clear out. I
> originally assumed that my merge function will be called/invoked per unique
> col1 value (in this case we have 10 partitions). But after the test, I
> found out that indeed it is called ONCE per partition of the 5 partitions.
> So in this sense, the partition meaning in this API (mapPartitions) IS
> DIFFERENT as the partition meaning defined in "sortWithPartitions",
> correct? Or my understanding of "partition" in sortWithPartitions is also
> WRONG?
>
> In summary, here are my questions:
> 1) We don't want to use "aggregation" API is due to that in my case, some
> unique value of "col1" COULD contain a big number of records, and sorting
> the data in a specified order per col1 helps our business case for the
> merge logic a lot.
> 2) We don't want to use "window" function, as the merge logic is indeed an 
> aggregation
> logic. There will be only one record output as per grouping (col1). So even
> "window" function comes with sorting, but it doesn't fit in this case.
> 3) The unique value count of "col1" is unpredictable for spark, I
> understand that. But I wonder if there is an API that can be used to be
> called per grouping (per col1), instead of per partition (as defined as 5
> partitions in this case).
> 4) If such API doesn't exist, and we have to use MapPartitionsFunction
> (The Iterator is much preferred here, as we don't need to worry OOM due to
> data skew), my following question is if Spark guarantees that the data
> comes within each partition is (col1, col2) order, in the API usage shown
> above? Or if Spark will delivery the data of each partition, sorted by
> "col2" for the first unique value of col1; then sorted by "col2" for the
> second unique value of col1, going forward, etc?
> Another challenge is that if our merge function can expect the data in
> this order, but have to generate the output per grouping of col1, in an
> Iterator format, does Spark have an existing example to refer?
>
> Thanks
>
> Yong
>


Re: hadoop java compatability

2019-12-24 Thread Akira Ajisaka
Hi Augustine,

Java 11 is not supported even in the latest version in Apache Hadoop.
I hope Apache Hadoop 3.3.0 will support Java 11 (only runtime support) but
3.3.0 is not released yet.
(Our company (Yahoo! JAPAN) builds trunk with OpenJDK 8 and run HDFS dev
cluster with OpenJDK 11 successfully.)

CDH 6 backports some patches from trunk and that's why it supports
openjdk11.

> Which jdks's/jvm's are supported
Now Apache Hadoop is using OpenJDK for build environment, and that's why
OpenJDK should be supported in the community.
https://github.com/apache/hadoop/blob/rel/release-3.2.1/dev-support/docker/Dockerfile#L92
Other jdk's/jvm's should work well, but I think they are not officially
supported.

I'll update the wiki to include that information.

Regards,
Akira


On Wed, Dec 4, 2019 at 3:36 AM Augustine Calvino 
wrote:

> I want to know what java versions are compatible with the latest hadoop.
> The most that the documentation
> 
>  says
> is to check the wiki
> ,
> which is obviously out of date since it only is listing java 6 and 7. That
> wiki has a link to another wiki page
> ,
> which is more helpful, as it actually lists the compatible versions of
> java. However, it doesn't say which jdk's/jvm's are supported (Oracle vs
> OpenJDK, vs AdoptOpenJDK, or hotspot vs open-j9). Also, that page says that
> hadoop 3 is only compatible with java 8, but Cloudera's CDH 6, which uses
> 
>  hadoop
> 3, apparently supports
> 
>  openjdk11.
>
> We would really like to use java 11 with our new hadoop builds due to the
> improved garbage collection, and would also prefer to use non-oracle jdk
> builds, but the messaging about what is supported seems sparse and
> sometimes contradictory.  Any clarification would be appreciated.
>
> Thanks!
>
> - Augustine
>


Re: How can we access multiple Kerberos-enabled Hadoop with different users in single JVM process

2019-12-24 Thread tobe
Thanks @Thibault but I don't think this could resolve our problem.

According to the answers of StackOverflow, we should set multiple realm in
the same jaas.conf file and update the system properties to specify the
Kerberos principle. But in our process, we need to process multiple
requests in the same time and setting the system properties may affect
other users' requests.

In our scenario, we want to setup the web service with multiple keytab
files which are for multiple KDC and request secured HDFS with these keytab
files at the same time. We can cache the UGI object for each keytab file
and it works with multiple users in the same KDC and not work with multiple
KDC in the same process.


Regards

On Tue, Dec 24, 2019 at 6:56 PM Thibault VERBEQUE <
thibault.verbe...@omnilog.fr> wrote:

> Hi,
>
>
>
> You can configure multiple realms inside /etc/krb5.conf on linux hosts, it
> will also require relevant DNS configuration and network access in order to
> work. See: https://stackoverflow.com/questions/26382936/multiple-realms
>
> Proxy-users won’t help you in any way, proxy-users rely on the fact that
> your service A from realm A can authenticate against service B from realm
> B, but if service’s A ticket is untrusted by service B from realm B,
> authentication will fail (that’s why cross-realm works). By the way
> proxy-users only request a token for user ‘toto’, but you already have user
> ‘toto’ keytab so it’s irrelevant.
>
> If you need slightly more dynamic (considering keytabs management) you can
> take a look at s4u2proxy and s4u2self, but those require admin access to
> KDC in order to be configured and a compatible kerberos implementation
> (Freeipa, MIT Kerberos and Active Directory).
>
>
>
> Regards
>
>
>
> *De :* tobe 
> *Envoyé :* mardi 24 décembre 2019 08:15
> *À :* Vinod Kumar Vavilapalli 
> *Cc :* user.hadoop 
> *Objet :* Re: How can we access multiple Kerberos-enabled Hadoop with
> different users in single JVM process
>
>
>
> Thanks @Vinod  and proxy-users was considered.
>
>
>
> But what we want to support is accessing multiple secured Hadoop. If we
> want to initialize the Kerberos credentials, we need config the file of
> /etc/krb5.conf. If we want to access two different Kerberos
> services(specified KDC), we can not run JVM process with two files of
> /etc/krb5.conf. That is why cross-realm can work because we only need to
> login with one KDC. Since we can take users' keytab files and proxy is not
> the critical problem for us.
>
>
>
> Please correct me if proxy-users can proxy different users from multiple
> secured Hadoop clusters.
>
>
>
>
>
> Regards
>
>
>
> On Tue, Dec 24, 2019 at 1:14 PM Vinod Kumar Vavilapalli <
> vino...@apache.org> wrote:
>
> You are looking for the proxy-users pattern. See here:
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html
>
>
>
> Thanks
>
> +Vinod
>
>
>
> On Dec 24, 2019, at 9:49 AM, tobe  wrote:
>
>
>
> Currently Hadoop relies on Kerberos to do authentication and
> authorization. For single user, we can initialize  clients with keytab
> files in command-line or Java program.
>
> But sometimes we need to access Hadoop as multiple users. For example, we
> build the web service to view users' HDFS files. We have authorization to
> get user name and use this user's keytab to login before requesting HDFS.
> However, this doesn't work for multiple Hadoop clusters and multiple KDC.
>
> Currently the only way to do that is enable cross-realm for these KDC. But
> in some scenarios we can not change the configuration of KDC and want
> single process to switch the Kerberos user on the fly without much overhead.
>
> Here is the related discussion in StackOverflow:
>
> ·
> https://stackoverflow.com/questions/15126295/using-java-programmatically-log-in-multiple-kerberos-realms-with-different-keyta#
> 
>
> ·
> https://stackoverflow.com/questions/57008499/data-transfer-between-two-kerberos-secured-cluster
>  ,
>
> ·
> https://stackoverflow.com/questions/22047145/hadoop-distcp-between-two-securedkerberos-clusters
>  ,
>
> ·
> https://stackoverflow.com/questions/39648106/access-two-secured-kerberos-hadoop-hbase-clusters-from-the-same-process
>
>
> ·
> https://stackoverflow.com/questions/1437281/reload-kerberos-config-in-java-without-restarting-jvm
>
>
>
> Regards
>
>
>
>


RE: How can we access multiple Kerberos-enabled Hadoop with different users in single JVM process

2019-12-24 Thread Thibault VERBEQUE
Hi,

You can configure multiple realms inside /etc/krb5.conf on linux hosts, it will 
also require relevant DNS configuration and network access in order to work. 
See: https://stackoverflow.com/questions/26382936/multiple-realms
Proxy-users won’t help you in any way, proxy-users rely on the fact that your 
service A from realm A can authenticate against service B from realm B, but if 
service’s A ticket is untrusted by service B from realm B, authentication will 
fail (that’s why cross-realm works). By the way proxy-users only request a 
token for user ‘toto’, but you already have user ‘toto’ keytab so it’s 
irrelevant.
If you need slightly more dynamic (considering keytabs management) you can take 
a look at s4u2proxy and s4u2self, but those require admin access to KDC in 
order to be configured and a compatible kerberos implementation (Freeipa, MIT 
Kerberos and Active Directory).

Regards

De : tobe 
Envoyé : mardi 24 décembre 2019 08:15
À : Vinod Kumar Vavilapalli 
Cc : user.hadoop 
Objet : Re: How can we access multiple Kerberos-enabled Hadoop with different 
users in single JVM process

Thanks @Vinod  and proxy-users was considered.

But what we want to support is accessing multiple secured Hadoop. If we want to 
initialize the Kerberos credentials, we need config the file of /etc/krb5.conf. 
If we want to access two different Kerberos services(specified KDC), we can not 
run JVM process with two files of /etc/krb5.conf. That is why cross-realm can 
work because we only need to login with one KDC. Since we can take users' 
keytab files and proxy is not the critical problem for us.

Please correct me if proxy-users can proxy different users from multiple 
secured Hadoop clusters.


Regards

On Tue, Dec 24, 2019 at 1:14 PM Vinod Kumar Vavilapalli 
mailto:vino...@apache.org>> wrote:
You are looking for the proxy-users pattern. See here: 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html

Thanks
+Vinod


On Dec 24, 2019, at 9:49 AM, tobe 
mailto:tobeg3oo...@gmail.com>> wrote:

Currently Hadoop relies on Kerberos to do authentication and authorization. For 
single user, we can initialize  clients with keytab files in command-line or 
Java program.

But sometimes we need to access Hadoop as multiple users. For example, we build 
the web service to view users' HDFS files. We have authorization to get user 
name and use this user's keytab to login before requesting HDFS. However, this 
doesn't work for multiple Hadoop clusters and multiple KDC.

Currently the only way to do that is enable cross-realm for these KDC. But in 
some scenarios we can not change the configuration of KDC and want single 
process to switch the Kerberos user on the fly without much overhead.

Here is the related discussion in StackOverflow:
·  
https://stackoverflow.com/questions/15126295/using-java-programmatically-log-in-multiple-kerberos-realms-with-different-keyta#
·  
https://stackoverflow.com/questions/57008499/data-transfer-between-two-kerberos-secured-cluster
 ,
·  
https://stackoverflow.com/questions/22047145/hadoop-distcp-between-two-securedkerberos-clusters
 ,
·  
https://stackoverflow.com/questions/39648106/access-two-secured-kerberos-hadoop-hbase-clusters-from-the-same-process
·  
https://stackoverflow.com/questions/1437281/reload-kerberos-config-in-java-without-restarting-jvm

Regards