I’m sorry but I have to revisit this again. Going through the reply below I 
realized that I didn’t quite get my question answered. Let me be more explicit 
with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? 
If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data 
cleansing. 

As far as I understand there is no hard and fast rule for it and it depends 
functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or 
remove unwanted or incorrect data and than the underlying stores usually have a 
concept of compaction which not only defragments data files but also at this 
point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy 
process and in some cases (e.g. Cassandra) there can be problems when there are 
too much data to be removed. Not only that, in some cases, marked-to-be-deleted 
data, until it is deleted/compacted can slow down normal operations of the data 
store as well.

One can also leverage in HBase's case the versioning mechanism and the 
afore-mentioned batch job can simply overwrite the same row key and the 
previous version would no longer be the latest. If max-version parameter is 
configured as 1 then no previous version would be maintained (physically it 
would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given 
the append-only and no hard-delete design approaches of most nosql stores, I 
would say it would be easier to do cleaning before data is loaded in the nosql 
store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the 
Lamda Architecture that combines batch and real-time computation for big data 
using various technologies and it uses the idea of constant periodic refreshes 
to reload the data and within this periodic refresh, the expectations are that 
any invalid older data would be corrected and overwritten by the new refresh 
load. Those basically the 'batch part' of the LA takes care of data cleansing 
by reloading everything. But LA is mostly for thouse systems which are ok with 
eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA 
<adaryl.wakefi...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse 
load. Now that we’re working in a schemaless environment, I’m not sure where 
data cleansing is supposed to take place. NoSQL sounds fun because 
theoretically you just drop everything in but transactional systems that 
generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in 
a modern data architecture? Before it hit hits Hadoop? After?

  B.

Reply via email to