Re: fetching and joining data from two different clusters

Jörn Franke Thu, 15 Jun 2017 13:28:43 -0700

On HDFS you have storage policies where you can define ssd etc 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html


Not sure if this is a similar offering to what you refer to.

Open stack swift is similar to S3 but for your own data center 
https://docs.openstack.org/developer/swift/associated_projects.html

> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> In Isilon etc you have SSD, middle layer and archive later where data is 
> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that 
> low level archive disk?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 20:42, Jörn Franke <jornfra...@gmail.com> wrote:
>> Well this happens also if you use amazon EMR - most data will be stored on 
>> S3 and there you have also no data locality. You can move it temporary to 
>> HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to 
>> process all the data. In fact, that is done in Spark machine learning 
>> algorithms (stochastic gradient descent etc). This will avoid that you need 
>> to move all the data through the networks and you loose only little 
>> precision (and you can statistically reason on that).
>> For a lot of data I see also the trend that companies move it anyway to 
>> cheap object storages (swift etc) to reduce cost - particularly because it 
>> is not used often.
>> 
>> 
>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>> wrote:
>>> 
>>> thanks Jorn.
>>> 
>>> If the idea is to separate compute from data using Isilon etc then one is 
>>> going to lose the locality of data.
>>> 
>>> Also the argument is that we would like to run queries/reports against two 
>>> independent clusters simultaneously so do this
>>> 
>>> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters 
>>> into Isilon OneFS
>>> Locate data from each cluster into its own zone in Isilon
>>> Run queries to combine data from each zone
>>> Use blue data to create virtual Hadoop clusters on top of Isilon so one 
>>> isolates the performance impact of analytics/Data Science versus other users
>>> 
>>> Now that is easily said than done as usual. First you have to migrate the 
>>> two existing clusters data into zones in Isilon. Then you are effectively 
>>> separating Compute from data so data locality is lost. This is no different 
>>> from your Spark cluster accessing data from each cluster. There are a lot 
>>> of tangential arguments here. Like Isilon will use RAID and you don't need 
>>> to replicate your data R3. Even including Isilon licensing cost, the total 
>>> cost goes down!
>>> 
>>> The side effect is the network now that you have lost data locality. How 
>>> fast your network is going to be to handle the throughputs. Networks are 
>>> shared across say a Bank unless you spend $$$ creating private infiniband 
>>> networks. Standard 10Gbits/s is not going to be good enough.
>>> 
>>> Also in reality blue data does not need Isilon. It runs on HP and other 
>>> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.  
>>> Alpha currently, will be released at end of this year. As we have not 
>>> started on Isilon it may be worth looking at this also?
>>> 
>>> Cheers  
>>>  
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> It does not matter to Spark you just put the HDFS URL of the namenode 
>>>> there. Of course the issue is that you loose data locality, but this would 
>>>> be also the case for Oracle.
>>>> 
>>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> With Spark how easy is it to fetch data from two different clusters and 
>>>>> do a join in Spark.
>>>>> 
>>>>> I can use two JDBC connections to join two tables from two different 
>>>>> Oracle instances in Spark though creating two Data Frames and joining 
>>>>> them together.
>>>>> 
>>>>> would that be possible for data residing on two different HDFS clusters?
>>>>> 
>>>>> thanks
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>> 
>

Re: fetching and joining data from two different clusters

Reply via email to