Re: fetching and joining data from two different clusters

Jörn Franke Sun, 18 Jun 2017 13:12:35 -0700

Sorry cannot help you there - I do not know the cost for isilon. I also cannot 
predict what the majority will do ...


> On 18. Jun 2017, at 21:49, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> thanks Jorn.
> 
> I have been told that Hadoop 3 (alpha testing now) will support Docking and 
> virtualised Hadoop clusters
> 
> Also if we decided to use something like Isolin and blue data to create 
> zoning (meaning two different Hadoop clusters migrated to Isolin storage each 
> residing on its zone/compartment) and virtualised clusters, we haave to 
> migrate two separate physical Hadoop clusters to Isolin and then create the 
> structure.
> 
> My point is if we went that way we have to weight up the cost and efforts in 
> migrating two Hadoop clusters to Isolin, versus adding one Hadoop cluster to 
> the other one to make one cluster out of two and still we have the underlying 
> HDFS file system. And then of course how many companies going this way and 
> overriding reason to use such approach. What will happen if we have 
> performance issues, where to pinpoint the bottleneck (Isolin) or third party 
> Hadoop vendor. There is really no community to rely on as well.
> 
> Your thoughts?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 21:27, Jörn Franke <jornfra...@gmail.com> wrote:
>> On HDFS you have storage policies where you can define ssd etc 
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
>> 
>> Not sure if this is a similar offering to what you refer to.
>> 
>> Open stack swift is similar to S3 but for your own data center 
>> https://docs.openstack.org/developer/swift/associated_projects.html
>> 
>>> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>> wrote:
>>> 
>>> In Isilon etc you have SSD, middle layer and archive later where data is 
>>> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that 
>>> low level archive disk?
>>> 
>>> thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 15 June 2017 at 20:42, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> Well this happens also if you use amazon EMR - most data will be stored on 
>>>> S3 and there you have also no data locality. You can move it temporary to 
>>>> HDFS or in-memory (ignite) and you can use sampling etc to avoid the need 
>>>> to process all the data. In fact, that is done in Spark machine learning 
>>>> algorithms (stochastic gradient descent etc). This will avoid that you 
>>>> need to move all the data through the networks and you loose only little 
>>>> precision (and you can statistically reason on that).
>>>> For a lot of data I see also the trend that companies move it anyway to 
>>>> cheap object storages (swift etc) to reduce cost - particularly because it 
>>>> is not used often.
>>>> 
>>>> 
>>>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> thanks Jorn.
>>>>> 
>>>>> If the idea is to separate compute from data using Isilon etc then one is 
>>>>> going to lose the locality of data.
>>>>> 
>>>>> Also the argument is that we would like to run queries/reports against 
>>>>> two independent clusters simultaneously so do this
>>>>> 
>>>>> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters 
>>>>> into Isilon OneFS
>>>>> Locate data from each cluster into its own zone in Isilon
>>>>> Run queries to combine data from each zone
>>>>> Use blue data to create virtual Hadoop clusters on top of Isilon so one 
>>>>> isolates the performance impact of analytics/Data Science versus other 
>>>>> users
>>>>> 
>>>>> Now that is easily said than done as usual. First you have to migrate the 
>>>>> two existing clusters data into zones in Isilon. Then you are effectively 
>>>>> separating Compute from data so data locality is lost. This is no 
>>>>> different from your Spark cluster accessing data from each cluster. There 
>>>>> are a lot of tangential arguments here. Like Isilon will use RAID and you 
>>>>> don't need to replicate your data R3. Even including Isilon licensing 
>>>>> cost, the total cost goes down!
>>>>> 
>>>>> The side effect is the network now that you have lost data locality. How 
>>>>> fast your network is going to be to handle the throughputs. Networks are 
>>>>> shared across say a Bank unless you spend $$$ creating private infiniband 
>>>>> networks. Standard 10Gbits/s is not going to be good enough.
>>>>> 
>>>>> Also in reality blue data does not need Isilon. It runs on HP and other 
>>>>> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.  
>>>>> Alpha currently, will be released at end of this year. As we have not 
>>>>> started on Isilon it may be worth looking at this also?
>>>>> 
>>>>> Cheers  
>>>>>  
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>> It does not matter to Spark you just put the HDFS URL of the namenode 
>>>>>> there. Of course the issue is that you loose data locality, but this 
>>>>>> would be also the case for Oracle.
>>>>>> 
>>>>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> With Spark how easy is it to fetch data from two different clusters and 
>>>>>>> do a join in Spark.
>>>>>>> 
>>>>>>> I can use two JDBC connections to join two tables from two different 
>>>>>>> Oracle instances in Spark though creating two Data Frames and joining 
>>>>>>> them together.
>>>>>>> 
>>>>>>> would that be possible for data residing on two different HDFS clusters?
>>>>>>> 
>>>>>>> thanks
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>> 
>>> 
>

Re: fetching and joining data from two different clusters

Reply via email to