Planning for unexpected outages with HBase is a very good idea. At a minimum, there will likely be points in time where you want to change HBase configuration, apply some patched jars, etc. A staging area that can buffer data for later processing and avoid dropping data on the floor makes this process much easier.

Apache Kafka is just one tool that can help with implementing such a staging area -- Apache NiFi is another you might want to look at. I'll avoid making any suggestions as to how you should do it because I don't know your requirements (nor really care to *wink*). There are lots of tools here, you'll need to do the research for your requirements and needs to evaluate what tools would work best.

Ash N wrote:

Hello,

We are building an Enterprise Datawarehouse on Phoenix(HBase)
Please refer the diagram attached.

The EDW supports an unified architecture that serves both Streaming and
batch use cases.

I am recommending a staging area that is source compliant (i.e. that
mimics source structure)
In the EDW path - data is always loaded into staging and then gets moved
to EDW.

Folks are not liking the idea due to an additional hop. They are saying
the hop is unnecessary and will cause latency issues.

I am saying latency can be handled in two ways:

1. The caching layer will take care
2. If designed properly, Latency is a function of hardware

What are your thoughts?

One other question -  is Kafka required at all???
It is introduced in the architecture for replay messages in case kinesis
connectivity issues.  So that we can replay messages.
Is there a better way to do it?

help as always is appreciated.


Inline image 1



thanks,
-ash




Reply via email to