Re: new object store driver for Spark

2016-03-22 Thread Benjamin Kim
Hi Gil,

Currently, our company uses S3 heavily for data storage. Can you further 
explain the benefits of this in relation to S3 when the pending patch does come 
out? Also, I have heard of Swift from others. Can you explain to me the pros 
and cons of Swift compared to HDFS? It can be just a brief summary if you like 
or just guide me to material that will help me get a better understanding.

Thanks,
Ben

> On Mar 22, 2016, at 6:35 AM, Gil Vernik  wrote:
> 
> We recently released an object store connector for Spark. 
> https://github.com/SparkTC/stocator 
> Currently this connector contains driver for the Swift based object store ( 
> like SoftLayer or any other Swift cluster ), but it can easily support 
> additional object stores.
> There is a pending patch to support Amazon S3 object store. 
> 
> The major highlight is that this connector doesn't create any temporary files 
>  and so it achieves very fast response times when Spark persist data in the 
> object store.
> The new connector supports speculate mode and covers various failure 
> scenarios ( like two Spark tasks writing into same object, partial corrupted 
> data due to run time exceptions in Spark master, etc ).  It also covers 
> https://issues.apache.org/jira/browse/SPARK-10063 
> and other known issues.
> 
> The detail algorithm for fault tolerance will be released very soon. For now, 
> those who interested, can view the implementation in the code itself.
> 
>  https://github.com/SparkTC/stocator 
> contains all the details how to setup 
> and use with Spark.
> 
> A series of tests showed that the new connector obtains 70% improvements for 
> write operations from Spark to Swift and about 30% improvements for read 
> operations from Swift into Spark ( comparing to the existing driver that 
> Spark uses to integrate with objects stored in Swift). 
> 
> There is an ongoing work to add more coverage and fix some known bugs / 
> limitations.
> 
> All the best
> Gil
> 



new object store driver for Spark

2016-03-22 Thread Gil Vernik
We recently released an object store connector for Spark. 
https://github.com/SparkTC/stocator
Currently this connector contains driver for the Swift based object store 
( like SoftLayer or any other Swift cluster ), but it can easily support 
additional object stores.
There is a pending patch to support Amazon S3 object store. 

The major highlight is that this connector doesn't create any temporary 
files  and so it achieves very fast response times when Spark persist data 
in the object store.
The new connector supports speculate mode and covers various failure 
scenarios ( like two Spark tasks writing into same object, partial 
corrupted data due to run time exceptions in Spark master, etc ).  It also 
covers https://issues.apache.org/jira/browse/SPARK-10063 and other known 
issues.

The detail algorithm for fault tolerance will be released very soon. For 
now, those who interested, can view the implementation in the code itself.

 https://github.com/SparkTC/stocator contains all the details how to setup 
and use with Spark.

A series of tests showed that the new connector obtains 70% improvements 
for write operations from Spark to Swift and about 30% improvements for 
read operations from Swift into Spark ( comparing to the existing driver 
that Spark uses to integrate with objects stored in Swift). 

There is an ongoing work to add more coverage and fix some known bugs / 
limitations.

All the best
Gil