Re: Data cleansing in modern data architecture

2014-08-24 Thread Peyman Mohajerian
*To:* user@hadoop.apache.org *Subject:* Re: Data cleansing in modern data architecture Hi Bob, the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it. If you're asking for best practices, it still

Re: Data cleansing in modern data architecture

2014-08-18 Thread Jens Scheidtmann
Hi Bob, the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it. If you're asking for best practices, it still depends: - How large are your files? - Have you enough free space for recoding? - Are you

Re: Data cleansing in modern data architecture

2014-08-10 Thread Adaryl Bob Wakefield, MBA
: Saturday, August 09, 2014 11:55 PM To: user@hadoop.apache.org Subject: Re: Data cleansing in modern data architecture While, I may not have enough context to your entire processing pipeline, here are my thoughts. 1. It's always useful to have raw data, irrespective of if it was right or wrong

Re: Data cleansing in modern data architecture

2014-08-10 Thread Sriram Ramachandrasekaran
...@gmail.com *Sent:* Saturday, August 09, 2014 11:55 PM *To:* user@hadoop.apache.org *Subject:* Re: Data cleansing in modern data architecture While, I may not have enough context to your entire processing pipeline, here are my thoughts. 1. It's always useful to have raw data, irrespective

Re: Data cleansing in modern data architecture

2014-08-10 Thread Bertrand Dechoux
/in/bobwakefieldmba Twitter: @BobLovesData *From:* Sriram Ramachandrasekaran sri.ram...@gmail.com *Sent:* Saturday, August 09, 2014 11:55 PM *To:* user@hadoop.apache.org *Subject:* Re: Data cleansing in modern data architecture While, I may not have enough context to your entire processing pipeline

Re: Data cleansing in modern data architecture

2014-08-10 Thread Adaryl Bob Wakefield, MBA
To: user@hadoop.apache.org Subject: Re: Data cleansing in modern data architecture Well, keeping bad data has its use too. I assume you know about temporal database. Back to your use case, if you only need to remove a few records from HDFS files, the easiest might be during the reading

Re: Data cleansing in modern data architecture

2014-08-09 Thread Adaryl Bob Wakefield, MBA
Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba From: Shahab Yunus Sent: Sunday, July 20, 2014 4:20 PM To: user@hadoop.apache.org Subject: Re: Data cleansing in modern data architecture I am assuming you meant the batch jobs that are/were used in old world for data cleansing. As far as I

Re: Data cleansing in modern data architecture

2014-08-09 Thread Adaryl Bob Wakefield, MBA
913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Adaryl Bob Wakefield, MBA Sent: Saturday, August 09, 2014 8:55 PM To: user@hadoop.apache.org Subject: Re: Data cleansing in modern data architecture Answer: No we can’t get rid of bad records. We have to go back

Re: Data cleansing in modern data architecture

2014-08-09 Thread Sriram Ramachandrasekaran
cleansing in modern data architecture Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would

Data cleansing in modern data architecture

2014-07-20 Thread Adaryl Bob Wakefield, MBA
In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that

Re: Data cleansing in modern data architecture

2014-07-20 Thread Shahab Yunus
I am assuming you meant the batch jobs that are/were used in old world for data cleansing. As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. It is also dependent on the technology being used and how it manages