On HDFS you have storage policies where you can define ssd etc https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
Not sure if this is a similar offering to what you refer to. Open stack swift is similar to S3 but for your own data center https://docs.openstack.org/developer/swift/associated_projects.html > On 15. Jun 2017, at 21:55, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > In Isilon etc you have SSD, middle layer and archive later where data is > moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that > low level archive disk? > > thanks > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 15 June 2017 at 20:42, Jörn Franke <jornfra...@gmail.com> wrote: >> Well this happens also if you use amazon EMR - most data will be stored on >> S3 and there you have also no data locality. You can move it temporary to >> HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to >> process all the data. In fact, that is done in Spark machine learning >> algorithms (stochastic gradient descent etc). This will avoid that you need >> to move all the data through the networks and you loose only little >> precision (and you can statistically reason on that). >> For a lot of data I see also the trend that companies move it anyway to >> cheap object storages (swift etc) to reduce cost - particularly because it >> is not used often. >> >> >>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> thanks Jorn. >>> >>> If the idea is to separate compute from data using Isilon etc then one is >>> going to lose the locality of data. >>> >>> Also the argument is that we would like to run queries/reports against two >>> independent clusters simultaneously so do this >>> >>> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters >>> into Isilon OneFS >>> Locate data from each cluster into its own zone in Isilon >>> Run queries to combine data from each zone >>> Use blue data to create virtual Hadoop clusters on top of Isilon so one >>> isolates the performance impact of analytics/Data Science versus other users >>> >>> Now that is easily said than done as usual. First you have to migrate the >>> two existing clusters data into zones in Isilon. Then you are effectively >>> separating Compute from data so data locality is lost. This is no different >>> from your Spark cluster accessing data from each cluster. There are a lot >>> of tangential arguments here. Like Isilon will use RAID and you don't need >>> to replicate your data R3. Even including Isilon licensing cost, the total >>> cost goes down! >>> >>> The side effect is the network now that you have lost data locality. How >>> fast your network is going to be to handle the throughputs. Networks are >>> shared across say a Bank unless you spend $$$ creating private infiniband >>> networks. Standard 10Gbits/s is not going to be good enough. >>> >>> Also in reality blue data does not need Isilon. It runs on HP and other >>> hardware also. In Apache Hadoop 3.0 docker engine on yarn is available. >>> Alpha currently, will be released at end of this year. As we have not >>> started on Isilon it may be worth looking at this also? >>> >>> Cheers >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>>> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote: >>>> It does not matter to Spark you just put the HDFS URL of the namenode >>>> there. Of course the issue is that you loose data locality, but this would >>>> be also the case for Oracle. >>>> >>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> With Spark how easy is it to fetch data from two different clusters and >>>>> do a join in Spark. >>>>> >>>>> I can use two JDBC connections to join two tables from two different >>>>> Oracle instances in Spark though creating two Data Frames and joining >>>>> them together. >>>>> >>>>> would that be possible for data residing on two different HDFS clusters? >>>>> >>>>> thanks >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>> loss, damage or destruction of data or any other property which may arise >>>>> from relying on this email's technical content is explicitly disclaimed. >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such loss, damage or destruction. >>>>> >>> >