Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben,

You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset the
performance impact of decoupling your compute and storage, as well as unify
multiple storage spaces if you would like to still use HDFS, S3, and/or
other storage solutions in tandem. Here is an article

which describes a similar architecture.

Hope this helps,
Calvin

On Mon, Feb 13, 2017 at 12:46 AM, Saisai Shao 
wrote:

> IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem
> layer which supports different FS implementations, HDFS is just one option.
> You could also use S3 as a backend FS, from Spark's point it is transparent
> to different FS implementations.
>
>
>
> On Sun, Feb 12, 2017 at 5:32 PM, ayan guha  wrote:
>
>> How about adding more NFS storage?
>>
>> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen  wrote:
>>
>>> Data has to live somewhere -- how do you not add storage but store more
>>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>>
>>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:
>>>
>>> Has anyone got some advice on how to remove the reliance on HDFS for
>>> storing persistent data. We have an on-premise Spark cluster. It seems like
>>> a waste of resources to keep adding nodes because of a lack of storage
>>> space only. I would rather add more powerful nodes due to the lack of
>>> processing power at a less frequent rate, than add less powerful nodes at a
>>> more frequent rate just to handle the ever growing data. Can anyone point
>>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>>> hear your thoughts.
>>>
>>> Cheers,
>>> Ben
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Remove dependence on HDFS

2017-02-13 Thread Saisai Shao
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer
which supports different FS implementations, HDFS is just one option. You
could also use S3 as a backend FS, from Spark's point it is transparent to
different FS implementations.



On Sun, Feb 12, 2017 at 5:32 PM, ayan guha  wrote:

> How about adding more NFS storage?
>
> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen  wrote:
>
>> Data has to live somewhere -- how do you not add storage but store more
>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>
>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:
>>
>> Has anyone got some advice on how to remove the reliance on HDFS for
>> storing persistent data. We have an on-premise Spark cluster. It seems like
>> a waste of resources to keep adding nodes because of a lack of storage
>> space only. I would rather add more powerful nodes due to the lack of
>> processing power at a less frequent rate, than add less powerful nodes at a
>> more frequent rate just to handle the ever growing data. Can anyone point
>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>> hear your thoughts.
>>
>> Cheers,
>> Ben
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Remove dependence on HDFS

2017-02-12 Thread ayan guha
How about adding more NFS storage?

On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen  wrote:

> Data has to live somewhere -- how do you not add storage but store more
> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>
> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:
>
> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. We have an on-premise Spark cluster. It seems like
> a waste of resources to keep adding nodes because of a lack of storage
> space only. I would rather add more powerful nodes due to the lack of
> processing power at a less frequent rate, than add less powerful nodes at a
> more frequent rate just to handle the ever growing data. Can anyone point
> me in the right direction? Is Alluxio a good solution? S3? I would like to
> hear your thoughts.
>
> Cheers,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha


Re: Remove dependence on HDFS

2017-02-12 Thread Sean Owen
Data has to live somewhere -- how do you not add storage but store more
data?  Alluxio is not persistent storage, and S3 isn't on your premises.

On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:

> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. We have an on-premise Spark cluster. It seems like
> a waste of resources to keep adding nodes because of a lack of storage
> space only. I would rather add more powerful nodes due to the lack of
> processing power at a less frequent rate, than add less powerful nodes at a
> more frequent rate just to handle the ever growing data. Can anyone point
> me in the right direction? Is Alluxio a good solution? S3? I would like to
> hear your thoughts.
>
> Cheers,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Remove dependence on HDFS

2017-02-12 Thread Jörn Franke
You're have to carefully choose if your strategy makes sense given your users 
workloads. Hence, I am not sure your reasoning makes sense.

However, You can , for example, install openstack swift  as an object store and 
use this as storage. HDFS in this case can be used as a temporary store and/or 
for checkpointing. Alternatively you can do this fully in-memory with ignite or 
alluxio.

S3 is the cloud storage provided by Amazon - this is not on premise. You can do 
the same here as a described above, but using s3 instead of swift.

> On 12 Feb 2017, at 05:28, Benjamin Kim  wrote:
> 
> Has anyone got some advice on how to remove the reliance on HDFS for storing 
> persistent data. We have an on-premise Spark cluster. It seems like a waste 
> of resources to keep adding nodes because of a lack of storage space only. I 
> would rather add more powerful nodes due to the lack of processing power at a 
> less frequent rate, than add less powerful nodes at a more frequent rate just 
> to handle the ever growing data. Can anyone point me in the right direction? 
> Is Alluxio a good solution? S3? I would like to hear your thoughts.
> 
> Cheers,
> Ben 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Remove dependence on HDFS

2017-02-11 Thread Benjamin Kim
Has anyone got some advice on how to remove the reliance on HDFS for storing 
persistent data. We have an on-premise Spark cluster. It seems like a waste of 
resources to keep adding nodes because of a lack of storage space only. I would 
rather add more powerful nodes due to the lack of processing power at a less 
frequent rate, than add less powerful nodes at a more frequent rate just to 
handle the ever growing data. Can anyone point me in the right direction? Is 
Alluxio a good solution? S3? I would like to hear your thoughts.

Cheers,
Ben 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org