As Vincent mentioned earlier, I think Alluxio can work for this. You can mount your (potentially remote) storage systems to Alluxio <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>, and deploy Alluxio co-located to the compute cluster. The computation framework will still achieve data locality since Alluxio workers are co-located, even though the existing storage systems may be remote. You can also use tiered storage <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to deploy using only memory, and/or other physical media.
Here are some blogs (Alluxio with Minio <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>, Alluxio with HDFS <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>, Alluxio with S3 <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>) which use similar architecture. Hope that helps, Gene On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > As a matter of interest what is the best way of creating virtualised > clusters all pointing to the same physical data? > > thanks > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 1 June 2017 at 09:27, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> If mandatory, you can use a local cache like alluxio >> >> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a >> écrit : >> >>> Thanks Vincent. I assume by physical data locality you mean you are >>> going through Isilon and HCFS and not through direct HDFS. >>> >>> Also I agree with you that shared network could be an issue as well. >>> However, it allows you to reduce data redundancy (you do not need R3 in >>> HDFS anymore) and also you can build virtual clusters on the same data. One >>> cluster for read/writes and another for Reads? That is what has been >>> suggestes!. >>> >>> regards >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 1 June 2017 at 08:55, vincent gromakowski < >>> vincent.gromakow...@gmail.com> wrote: >>> >>>> I don't recommend this kind of design because you loose physical data >>>> locality and you will be affected by "bad neighboors" that are also using >>>> the network storage... We have one similar design but restricted to small >>>> clusters (more for experiments than production) >>>> >>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: >>>> >>>>> Thanks Jorn, >>>>> >>>>> This was a proposal made by someone as the firm is already using this >>>>> tool on other SAN based storage and extend it to Big Data >>>>> >>>>> On paper it seems like a good idea, in practice it may be a Wandisco >>>>> scenario again.. Of course as ever one needs to EMC for reference calls >>>>> ans whether anyone is using this product in anger. >>>>> >>>>> >>>>> >>>>> At the end of the day it's not HDFS. It is OneFS with a HCFS API. >>>>> However that may suit our needs. But would need to PoC it and test it >>>>> thoroughly! >>>>> >>>>> >>>>> Cheers >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> On 1 June 2017 at 08:21, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have done this (not Isilon, but another storage system). It can be >>>>>> efficient for small clusters and depending on how you design the network. >>>>>> >>>>>> What I have also seen is the microservice approach with object stores >>>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar. >>>>>> >>>>>> If you want additional performance you could fetch the data from the >>>>>> object stores and store it temporarily in a local HDFS. Not sure to what >>>>>> extent this affects regulatory requirements though. >>>>>> >>>>>> Best regards >>>>>> >>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I realize this may not have direct relevance to Spark but has anyone >>>>>> tried to create virtualized HDFS clusters using tools like ISILON or >>>>>> similar? >>>>>> >>>>>> The prime motive behind this approach is to minimize the propagation >>>>>> or copy of data which has regulatory implication. In shoret you want your >>>>>> data to be in one place regardless of artefacts used against it such as >>>>>> Spark? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >