[jira] [Commented] (SPARK-7481) Add spark-cloud module to pull in aws+azure object store FS accessors; test integration

Steve Loughran (JIRA) Fri, 22 Jul 2016 08:54:30 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389730#comment-15389730
 ]


Steve Loughran commented on SPARK-7481:
---------------------------------------

ps, latest s3a state

# [Object stores in 
production|http://slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production]
# [Latest s3a 
docs|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md].
 

The options I'm going to recomment for working with ORC (or other data with 
forward & backward seeks on read) are:
{code}
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.fs.s3a.experimental.input.fadvise random
spark.hadoop.fs.s3a.readahead.range = 131072
spark.hadoop.fs.s3a.socket.send.buffer = 16384
spark.hadoop.fs.s3a.socket.recv.buffer  = 16384
{code}

There are some other tunables (as are those ranges and buffers). fadvise=random 
is fantastic on random access/positioned read; kills sequential scans though 
things like CSV files. Use only when appropriate.

Spark will automatically get the speedups in S3A. What it will also need (and 
which I haven't started on), is turning the spark code itself the way that 
[~rajesh.balamohan] and [~ashutoshc] are doing for Hive: cache and re-use all 
FileStatus results, use listFiles(recursive=true) for tree listing, move all 
rename/deletes for cleanup off the query path, etc, etc. 

This patch is just step 1: packaging and basic integration tests & hadoop-aws 
regression testing —not the tuning which spark will need for maximum object 
store perf (none of which will hurt HDFS performance, BTW)


> Add spark-cloud module to pull in aws+azure object store FS accessors; test 
> integration
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-7481
>                 URL: https://issues.apache.org/jira/browse/SPARK-7481
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.3.1
>            Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add spark-cloud module to pull in aws+azure object store FS accessors; test integration

Reply via email to