[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

Steve Loughran (JIRA) Wed, 13 Feb 2019 03:42:27 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767077#comment-16767077
 ]


Steve Loughran commented on SPARK-23534:
----------------------------------------

bq. am curious to know if hadoop3 offers much performance benefit

if you are using S3 as the destination of work you get an output committer 
which is O(files), not O(data), and can cope with an inconsistent store 
(HADOOP-13786). Not sure of what else you can point to and say "tangible 
speedup", though can point to stuff and and say 'tangible functionality 
improvement"

with Hadoop 3.2 spark can generate delegation tokens for an S3 filesystem 
during spark-submit (HADOOP-14556), and include them in the Yarn app launch. 
This lets you deploy a cluster in EC2 with the VMs deployed in an IAM role with 
lower privileges  than you: a generated session login and your encryption 
secrets will come with the job. This is very slick. And if you ask for role 
delegation tokens then the generated token is limited to the specific s3 bucket 
and DDB table you are working with. Video of distcp in action: 
https://www.youtube.com/watch?v=rpyLkDEzIxI

Also ships with the abfs:// connector to Azure Datalake Gen 2 storage; 
Microsoft's latest iteration of Azure storage. 

> Spark run on Hadoop 3.0.0
> -------------------------
>
>                 Key: SPARK-23534
>                 URL: https://issues.apache.org/jira/browse/SPARK-23534
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 2.3.0
>            Reporter: Saisai Shao
>            Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

Reply via email to