Re: Question about Spark and filesystems

2017-01-03 Thread Steve Loughran

On 18 Dec 2016, at 19:50, joa...@verona.se wrote:

Since each Spark worker node needs to access the same files, we have
tried using Hdfs. This worked, but there were some oddities making me a
bit uneasy. For dependency hell reasons I compiled a modified Spark, and
this version exhibited the odd behaviour with Hdfs. The problem might
have nothing to do with Hdfs, but the situation made me curious about
the alternatives.

what were the oddities?


Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi,

If you are concerned with the performance of the alternative filesystems
(ie. needing a caching client), you can use Alluxio on top of any of NFS
,
Ceph

, GlusterFS
,
or other/multiple storages. Especially since your working sets will not be
huge, you most likely will be able to store all the relevant data within
Alluxio during computation, giving you flexibility to store your data in
your preferred storage without performance penalties.

Hope this helps,
Calvin

On Sun, Dec 18, 2016 at 11:23 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> I am using gluster and i have decent performance with basic maintenance
> effort. Advantage of gluster: you can plug Alluxio on top to improve perf
> but I still need to be validate...
>
> Le 18 déc. 2016 8:50 PM,  a écrit :
>
>> Hello,
>>
>> We are trying out Spark for some file processing tasks.
>>
>> Since each Spark worker node needs to access the same files, we have
>> tried using Hdfs. This worked, but there were some oddities making me a
>> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
>> this version exhibited the odd behaviour with Hdfs. The problem might
>> have nothing to do with Hdfs, but the situation made me curious about
>> the alternatives.
>>
>> Now I'm wondering what kind of file system would be suitable for our
>> deployment.
>>
>> - There won't be a great number of nodes. Maybe 10 or so.
>>
>> - The datasets won't be big by big-data standards(Maybe a couple of
>>   hundred gb)
>>
>> So maybe I could just use a NFS server, with a caching client?
>> Or should I try Ceph, or Glusterfs?
>>
>> Does anyone have any experiences to share?
>>
>> --
>> Joakim Verona
>> joa...@verona.se
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Question about Spark and filesystems

2016-12-18 Thread vincent gromakowski
I am using gluster and i have decent performance with basic maintenance
effort. Advantage of gluster: you can plug Alluxio on top to improve perf
but I still need to be validate...

Le 18 déc. 2016 8:50 PM,  a écrit :

> Hello,
>
> We are trying out Spark for some file processing tasks.
>
> Since each Spark worker node needs to access the same files, we have
> tried using Hdfs. This worked, but there were some oddities making me a
> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
> this version exhibited the odd behaviour with Hdfs. The problem might
> have nothing to do with Hdfs, but the situation made me curious about
> the alternatives.
>
> Now I'm wondering what kind of file system would be suitable for our
> deployment.
>
> - There won't be a great number of nodes. Maybe 10 or so.
>
> - The datasets won't be big by big-data standards(Maybe a couple of
>   hundred gb)
>
> So maybe I could just use a NFS server, with a caching client?
> Or should I try Ceph, or Glusterfs?
>
> Does anyone have any experiences to share?
>
> --
> Joakim Verona
> joa...@verona.se
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Question about Spark and filesystems

2016-12-18 Thread joakim
Hello,

We are trying out Spark for some file processing tasks.

Since each Spark worker node needs to access the same files, we have
tried using Hdfs. This worked, but there were some oddities making me a
bit uneasy. For dependency hell reasons I compiled a modified Spark, and
this version exhibited the odd behaviour with Hdfs. The problem might
have nothing to do with Hdfs, but the situation made me curious about
the alternatives.

Now I'm wondering what kind of file system would be suitable for our
deployment.

- There won't be a great number of nodes. Maybe 10 or so.

- The datasets won't be big by big-data standards(Maybe a couple of
  hundred gb)

So maybe I could just use a NFS server, with a caching client?
Or should I try Ceph, or Glusterfs?

Does anyone have any experiences to share?

-- 
Joakim Verona
joa...@verona.se

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org