Re: Possible DR solution

deepak.subhramanian Sat, 12 Nov 2016 11:28:40 -0800


Sent from my Samsung Galaxy smartphone.
-------- Original message --------From: Timur Shenkao <t...@timshenkao.su> 
Date: 12/11/2016  09:17  (GMT-08:00) To: Mich Talebzadeh 
<mich.talebza...@gmail.com>, user@spark.apache.org Subject: Re: Possible DR 
solution 
Hi guys!

1) Though it's quite interesting, I believe that this discussion is not about 
Spark :)
2) If you are interested, there is solution by Cloudera 
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html
 (requires that *source cluster* has Cloudera Enterprise license, so it's not 
for free). 
Correct me but I don't remember specialized replication solution by Hortonworks 
(Atlas, Falcon, etc. are not precisely about inter-custer replication).
Some solutions from Hadoop  Ecosystem try to implement replication of their 
own: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 
, 
http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html
 , 
3) Read this discussion 
https://community.hortonworks.com/questions/29645/hdfs-replication-for-dr.html
4) I prefer bash scripts / Python scripts / Oozie jobs + distcp - it's for free 
& I control what's going on precisely. But, in case of huge clusters & 
sophisticated logic, this approach become cumbersome.
5) Don't forget about security & encryption: your sensitive data may be read by 
third-party agents during replication

On Sat, Nov 12, 2016 at 6:05 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:
Thanks Jorn.
The way WanDisco promotes itself is doing block level replication. as I 
understand you modify core-file.xml and add couple of network server locations 
there. they call this tool Fusion. there are at least 2 fusion servers for high 
availability. each one among other things has a database of its own. Once the 
client interacts with HDFS the fusion server behaves like a sniffer  with its 
own port. As soon as the first HTFS block of 256MBout of say a file of 30GB is 
written, it starts sending that block to recipient. the laws of physics, the 
pipeline size etc applies here. That is up to the consumer. it can 10 files at 
the same time etc. so that is all. It is a known technology now labeled as 
streaming. so in summary it does not have to wait for the full file to be 
written to HDFS before replicating blocks.  that is where it scores.
It helps WAN work. Say the primary/active HDFS is in London and the replicate 
is in Singapore. so users in Singapore can see replicated data (eventually) 
when it gets there. It can obviously be used for DR in that case it is like Hot 
standby (borrowing a terminology from Sybase). In contrast one can do the same 
with period loads with homemade tools or tools like BDR from Cloudera.
I mentioned that Hive is going to have its metastore on Hbase as well and that 
can be potential problems. The site is here 
They are claiming there is no competitors in the market for their streaming HA 
product.
HTH


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 12 November 2016 at 11:17, Jörn Franke <jornfra...@gmail.com> wrote:
What is wrong with the good old batch transfer for transferring data from a 
cluster to another? I assume your use case is only business continuity in case 
of disasters such as data center loss, which are unlikely to happen (well it 
does not mean they do not happen) and where you could afford to loose one day 
(or hour) of data (depends!).
Nevertheless, I assume he refers to the Hadoop storage policies: 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
 , but this still only works for the same cluster. 
You could also develop a custom secondary file system, similar to the Ignite 
Cache filesystem, that sits on top of HDFS and as soon as it receives data it 
sends them to another cluster and provides it to HDFS. Not knowing Wandisco, I 
assume what it does. Given the prices (and the fact that clusters tend to grow) 
you may want to evaluate if buying or making makes sense. In any case, it also 
requires evaluation of network throughput, because this may become the 
bottleneck somewhere (either within the cluster or more likely between data 
centers).
As you mentioned, Hbase & Co may require a special consideration for the case 
that data is in-memory and not yet persisted.
On Sat, Nov 12, 2016 at 12:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:
thanks Vince
can you provide more details on this pls


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 12 November 2016 at 09:52, vincent gromakowski 
<vincent.gromakow...@gmail.com> wrote:
A Hdfs tiering policy with good tags should be similar

Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a écrit 
:
I really don't see why one wants to set up streaming replication unless for 
situations where similar functionality to transactional databases is required 
in big data?


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 November 2016 at 17:24, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
I think it differs as it starts streaming data through its own port as soon as 
the first block is landed. so the granularity is a block.
however, think of it as oracle golden gate replication or sap replication for 
databases. the only difference is that if the corruption in the block with hdfs 
it will be freplicated much like srdf.
whereas with oracle or sap it is log based replication which stops when it 
encounters corruption.
replication depends on the block. so can replicate hive metadata and fsimage 
etc. but cannot replicate hbase memstore if hbase crashes.
so that is the gist of it. streaming replication as opposed to snapshot. 
sounds familiar. think of it as log shipping in oracle old days versus 
goldengate etc.
hth


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 November 2016 at 17:14, Deepak Sharma <deepakmc...@gmail.com> wrote:
Reason being you can set up hdfs duplication on your own to some other cluster .

On Nov 11, 2016 22:42, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
reason being ?


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 November 2016 at 17:11, Deepak Sharma <deepakmc...@gmail.com> wrote:
This is waste of money I guess.

On Nov 11, 2016 22:41, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
starts at $4,000 per node per year all inclusive.
With discount it can be halved but we are talking a node itself so if you have 
5 nodes in primary and 5 nodes in DR we are talking about $40K already.
HTH


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 November 2016 at 16:43, Mudit Kumar <mkumar...@sapient.com> wrote:








Is it feasible cost wise?
 
Thanks,
Mudit
 
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]


Sent: Friday, November 11, 2016 2:56 PM

To: user @spark

Subject: Possible DR solution
 


Hi,


 


Has anyone had experience of using 
WanDisco block replication to create a fault tolerant solution to DR in Hadoop?


 


The product claims that it starts replicating as soon as the first data block 
lands on HDFS and takes the block and sends it to DR/replicate site. The idea 
is that is faster than doing it through traditional HDFS copy tools which are 
normally
 batch oriented.


 


It also claims to replicate Hive metadata as well.


 


I wanted to gauge if anyone has used it or a competitor product. The claim is 
that they do not have competitors!


 


Thanks


 















Dr Mich Talebzadeh
 
LinkedIn
 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
http://talebzadehmich.wordpress.com
 
Disclaimer: Use it at your own risk.
Any and all responsibility for any loss, damage or destruction of data or any 
other property which may arise from relying on this email's technical content 
is explicitly disclaimed. The author will in no case
 be liable for any monetary damages arising from such loss, damage or 
destruction.
Re: Possible DR solution

Reply via email to