Re: Data cleansing in modern data architecture

2014-08-09 Thread Sriram Ramachandrasekaran
While, I may not have enough context to your entire processing pipeline, here are my thoughts. 1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t. 2. Note that, You only know that the data at timestamp

Re: Data cleansing in modern data architecture

2014-08-09 Thread Adaryl "Bob" Wakefield, MBA
Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics 91

Re: Data cleansing in modern data architecture

2014-08-09 Thread Adaryl "Bob" Wakefield, MBA
Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop dat

Top K words problem

2014-08-09 Thread Zhige Xin
I have a question about hadoop that how to modify the wordcount program to give the top K words according to their occurrences. The naive method is to count and sort but it needs too many lines of code and is not elegant. Another one uses a data structure, called TreeMap, to solve this problem, wh

Can anyone help me resolve this Error: unable to create new native thread

2014-08-09 Thread Chris MacKenzie
Hi, I¹ve scrabbled around looking for a fix for a while and have set the soft ulimit size to 13172. I¹m using Hadoop 2.4.1 Thanks in advance, Chris MacKenzie telephone: 0131 332 6967 email: stu...@chrismackenziephotography.co.uk corporate: www.chrismackenziephotography.co.uk

Re: Data cleansing in modern data architecture

2014-08-09 Thread Adaryl "Bob" Wakefield, MBA
I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario. There is a bug in the transactional system. The data gets written to HDFS where it winds up in Hive. Somebody notices that