Re: An Architecture question on the use of virtualised clusters

Gene Pang Fri, 02 Jun 2017 11:10:41 -0700

As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
your (potentially remote) storage systems to Alluxio
<http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
and deploy Alluxio co-located to the compute cluster. The computation
framework will still achieve data locality since Alluxio workers are
co-located, even though the existing storage systems may be remote. You can
also use tiered storage
<http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to
deploy using only memory, and/or other physical media.


Here are some blogs (Alluxio with Minio
<https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
Alluxio with HDFS
<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
Alluxio with S3
<https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
which use similar architecture.

Hope that helps,
Gene

On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> As a matter of interest what is the best way of creating virtualised
> clusters all pointing to the same physical data?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 09:27, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> If mandatory, you can use a local cache like alluxio
>>
>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
>> écrit :
>>
>>> Thanks Vincent. I assume by physical data locality you mean you are
>>> going through Isilon and HCFS and not through direct HDFS.
>>>
>>> Also I agree with you that shared network could be an issue as well.
>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>> cluster for read/writes and another for Reads? That is what has been
>>> suggestes!.
>>>
>>> regards
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> I don't recommend this kind of design because you loose physical data
>>>> locality and you will be affected by "bad neighboors" that are also using
>>>> the network storage... We have one similar design but restricted to small
>>>> clusters (more for experiments than production)
>>>>
>>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>>>
>>>>> Thanks Jorn,
>>>>>
>>>>> This was a proposal made by someone as the firm is already using this
>>>>> tool on other SAN based storage and extend it to Big Data
>>>>>
>>>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>>>> ans whether anyone is using this product in anger.
>>>>>
>>>>>
>>>>>
>>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>>> thoroughly!
>>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 1 June 2017 at 08:21, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have done this (not Isilon, but another storage system). It can be
>>>>>> efficient for small clusters and depending on how you design the network.
>>>>>>
>>>>>> What I have also seen is the microservice approach with object stores
>>>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>>>>
>>>>>> If you want additional performance you could fetch the data from the
>>>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>>>> extent this affects regulatory requirements though.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I realize this may not have direct relevance to Spark but has anyone
>>>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>>>> similar?
>>>>>>
>>>>>> The prime motive behind this approach is to minimize the propagation
>>>>>> or copy of data which has regulatory implication. In shoret you want your
>>>>>> data to be in one place regardless of artefacts used against it such as
>>>>>> Spark?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: An Architecture question on the use of virtualised clusters

Reply via email to