Re: fetching and joining data from two different clusters

Mich Talebzadeh Thu, 15 Jun 2017 12:34:37 -0700

thanks Jorn.

If the idea is to separate compute from data using Isilon etc then one is
going to lose the locality of data.

Also the argument is that we would like to run queries/reports against two
independent clusters simultaneously so do this

   1. Use Isilon OneFS
   <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
   data to migrate two independent Hadoop clusters into Isilon OneFS
   2. Locate data from each cluster into its own zone in Isilon
   3. Run queries to combine data from each zone
   4. Use blue data

<https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
   to create virtual Hadoop clusters on top of Isilon so one isolates the
   performance impact of analytics/Data Science versus other users

Now that is easily said than done as usual. First you have to migrate the
two existing clusters data into zones in Isilon. Then you are effectively
separating Compute from data so data locality is lost. This is no different
from your Spark cluster accessing data from each cluster. There are a lot
of tangential arguments here. Like Isilon will use RAID and you don't need
to replicate your data R3. Even including Isilon licensing cost, the total
cost goes down!

The side effect is the network now that you have lost data locality. How
fast your network is going to be to handle the throughputs. Networks are
shared across say a Bank unless you spend $$$ creating private infiniband
networks. Standard 10Gbits/s is not going to be good enough.

Also in reality blue data does not need Isilon. It runs on HP and other
hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
Alpha currently, will be released at end of this year. As we have not
started on Isilon it may be worth looking at this also?

Cheers

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote:

> It does not matter to Spark you just put the HDFS URL of the namenode
> there. Of course the issue is that you loose data locality, but this would
> be also the case for Oracle.
>
> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> With Spark how easy is it to fetch data from two different clusters and do
> a join in Spark.
>
> I can use two JDBC connections to join two tables from two different
> Oracle instances in Spark though creating two Data Frames and joining them
> together.
>
> would that be possible for data residing on two different HDFS clusters?
>
> thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: fetching and joining data from two different clusters

Reply via email to