Re: fetching and joining data from two different clusters

Mich Talebzadeh Thu, 15 Jun 2017 12:55:50 -0700

In Isilon etc you have SSD, middle layer and archive later where data is
moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
low level archive disk?


thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 June 2017 at 20:42, Jörn Franke <jornfra...@gmail.com> wrote:

> Well this happens also if you use amazon EMR - most data will be stored on
> S3 and there you have also no data locality. You can move it temporary to
> HDFS or in-memory (ignite) and you can use sampling etc to avoid the need
> to process all the data. In fact, that is done in Spark machine learning
> algorithms (stochastic gradient descent etc). This will avoid that you need
> to move all the data through the networks and you loose only little
> precision (and you can statistically reason on that).
> For a lot of data I see also the trend that companies move it anyway to
> cheap object storages (swift etc) to reduce cost - particularly because it
> is not used often.
>
>
> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> thanks Jorn.
>
> If the idea is to separate compute from data using Isilon etc then one is
> going to lose the locality of data.
>
> Also the argument is that we would like to run queries/reports against two
> independent clusters simultaneously so do this
>
>
>    1. Use Isilon OneFS
>    <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
>    data to migrate two independent Hadoop clusters into Isilon OneFS
>    2. Locate data from each cluster into its own zone in Isilon
>    3. Run queries to combine data from each zone
>    4. Use blue data
>    
> <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
>    to create virtual Hadoop clusters on top of Isilon so one isolates the
>    performance impact of analytics/Data Science versus other users
>
>
> Now that is easily said than done as usual. First you have to migrate the
> two existing clusters data into zones in Isilon. Then you are effectively
> separating Compute from data so data locality is lost. This is no different
> from your Spark cluster accessing data from each cluster. There are a lot
> of tangential arguments here. Like Isilon will use RAID and you don't need
> to replicate your data R3. Even including Isilon licensing cost, the total
> cost goes down!
>
> The side effect is the network now that you have lost data locality. How
> fast your network is going to be to handle the throughputs. Networks are
> shared across say a Bank unless you spend $$$ creating private infiniband
> networks. Standard 10Gbits/s is not going to be good enough.
>
> Also in reality blue data does not need Isilon. It runs on HP and other
> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
> Alpha currently, will be released at end of this year. As we have not
> started on Isilon it may be worth looking at this also?
>
> Cheers
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> It does not matter to Spark you just put the HDFS URL of the namenode
>> there. Of course the issue is that you loose data locality, but this would
>> be also the case for Oracle.
>>
>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> With Spark how easy is it to fetch data from two different clusters and
>> do a join in Spark.
>>
>> I can use two JDBC connections to join two tables from two different
>> Oracle instances in Spark though creating two Data Frames and joining them
>> together.
>>
>> would that be possible for data residing on two different HDFS clusters?
>>
>> thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>

Re: fetching and joining data from two different clusters

Reply via email to