Sent from my Samsung Galaxy smartphone.
-------- Original message --------From: Timur Shenkao <t...@timshenkao.su>
Date: 12/11/2016 09:17 (GMT-08:00) To: Mich Talebzadeh
<mich.talebza...@gmail.com>, user@spark.apache.org Subject: Re: Possible DR
solution
Hi guys!
1) Though it's quite interesting, I believe that this discussion is not about
Spark :)
2) If you are interested, there is solution by Cloudera
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html
(requires that *source cluster* has Cloudera Enterprise license, so it's not
for free).
Correct me but I don't remember specialized replication solution by Hortonworks
(Atlas, Falcon, etc. are not precisely about inter-custer replication).
Some solutions from Hadoop Ecosystem try to implement replication of their
own: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462
,
http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html
,
3) Read this discussion
https://community.hortonworks.com/questions/29645/hdfs-replication-for-dr.html
4) I prefer bash scripts / Python scripts / Oozie jobs + distcp - it's for free
& I control what's going on precisely. But, in case of huge clusters &
sophisticated logic, this approach become cumbersome.
5) Don't forget about security & encryption: your sensitive data may be read by
third-party agents during replication
On Sat, Nov 12, 2016 at 6:05 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
Thanks Jorn.
The way WanDisco promotes itself is doing block level replication. as I
understand you modify core-file.xml and add couple of network server locations
there. they call this tool Fusion. there are at least 2 fusion servers for high
availability. each one among other things has a database of its own. Once the
client interacts with HDFS the fusion server behaves like a sniffer with its
own port. As soon as the first HTFS block of 256MBout of say a file of 30GB is
written, it starts sending that block to recipient. the laws of physics, the
pipeline size etc applies here. That is up to the consumer. it can 10 files at
the same time etc. so that is all. It is a known technology now labeled as
streaming. so in summary it does not have to wait for the full file to be
written to HDFS before replicating blocks. that is where it scores.
It helps WAN work. Say the primary/active HDFS is in London and the replicate
is in Singapore. so users in Singapore can see replicated data (eventually)
when it gets there. It can obviously be used for DR in that case it is like Hot
standby (borrowing a terminology from Sybase). In contrast one can do the same
with period loads with homemade tools or tools like BDR from Cloudera.
I mentioned that Hive is going to have its metastore on Hbase as well and that
can be potential problems. The site is here
They are claiming there is no competitors in the market for their streaming HA
product.
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 12 November 2016 at 11:17, Jörn Franke <jornfra...@gmail.com> wrote:
What is wrong with the good old batch transfer for transferring data from a
cluster to another? I assume your use case is only business continuity in case
of disasters such as data center loss, which are unlikely to happen (well it
does not mean they do not happen) and where you could afford to loose one day
(or hour) of data (depends!).
Nevertheless, I assume he refers to the Hadoop storage policies:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
, but this still only works for the same cluster.
You could also develop a custom secondary file system, similar to the Ignite
Cache filesystem, that sits on top of HDFS and as soon as it receives data it
sends them to another cluster and provides it to HDFS. Not knowing Wandisco, I
assume what it does. Given the prices (and the fact that clusters tend to grow)
you may want to evaluate if buying or making makes sense. In any case, it also
requires evaluation of network throughput, because this may become the
bottleneck somewhere (either within the cluster or more likely between data
centers).
As you mentioned, Hbase & Co may require a special consideration for the case
that data is in-memory and not yet persisted.
On Sat, Nov 12, 2016 at 12:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
thanks Vince
can you provide more details on this pls
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 12 November 2016 at 09:52, vincent gromakowski
<vincent.gromakow...@gmail.com> wrote:
A Hdfs tiering policy with good tags should be similar
Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a écrit
:
I really don't see why one wants to set up streaming replication unless for
situations where similar functionality to transactional databases is required
in big data?
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 11 November 2016 at 17:24, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
I think it differs as it starts streaming data through its own port as soon as
the first block is landed. so the granularity is a block.
however, think of it as oracle golden gate replication or sap replication for
databases. the only difference is that if the corruption in the block with hdfs
it will be freplicated much like srdf.
whereas with oracle or sap it is log based replication which stops when it
encounters corruption.
replication depends on the block. so can replicate hive metadata and fsimage
etc. but cannot replicate hbase memstore if hbase crashes.
so that is the gist of it. streaming replication as opposed to snapshot.
sounds familiar. think of it as log shipping in oracle old days versus
goldengate etc.
hth
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 11 November 2016 at 17:14, Deepak Sharma <deepakmc...@gmail.com> wrote:
Reason being you can set up hdfs duplication on your own to some other cluster .
On Nov 11, 2016 22:42, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
reason being ?
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 11 November 2016 at 17:11, Deepak Sharma <deepakmc...@gmail.com> wrote:
This is waste of money I guess.
On Nov 11, 2016 22:41, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
starts at $4,000 per node per year all inclusive.
With discount it can be halved but we are talking a node itself so if you have
5 nodes in primary and 5 nodes in DR we are talking about $40K already.
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 11 November 2016 at 16:43, Mudit Kumar <mkumar...@sapient.com> wrote:
Is it feasible cost wise?
Thanks,
Mudit
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, November 11, 2016 2:56 PM
To: user @spark
Subject: Possible DR solution
Hi,
Has anyone had experience of using
WanDisco block replication to create a fault tolerant solution to DR in Hadoop?
The product claims that it starts replicating as soon as the first data block
lands on HDFS and takes the block and sends it to DR/replicate site. The idea
is that is faster than doing it through traditional HDFS copy tools which are
normally
batch oriented.
It also claims to replicate Hive metadata as well.
I wanted to gauge if anyone has used it or a competitor product. The claim is
that they do not have competitors!
Thanks
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk.
Any and all responsibility for any loss, damage or destruction of data or any
other property which may arise from relying on this email's technical content
is explicitly disclaimed. The author will in no case
be liable for any monetary damages arising from such loss, damage or
destruction.