Re: Merging small files

2014-07-20 Thread Kilaru, Sambaiah
This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr. Small files go into one container (to fill up 256MB or what ever container size) and with locality most Of the mappers go to three datanodes. You should be looking into sequence file format.

Re: Merging small files

2014-07-20 Thread Edward Capriolo
Don't have time to read the thread, but incase it has not been mentioned Unleash filecrusher! https://github.com/edwardcapriolo/filecrush On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah sambaiah_kil...@intuit.com wrote: This is not place to discuss merits or demerits of MapR, Small

Re: Merging small files

2014-07-20 Thread Adaryl Bob Wakefield, MBA
It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. A little bit more on topic: Every single thing I read or watch about Hadoop says

Re: Merging small files

2014-07-20 Thread Shahab Yunus
Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E You can always

Re: Merging small files

2014-07-20 Thread Adaryl Bob Wakefield, MBA
“Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.” Yeah good point. Adaryl Bob Wakefield, MBA

Re: Merging small files

2014-07-20 Thread Shashidhar Rao
Spring batch is used to process the files which come in EDI ,CSV XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated roughly by 5 million customers every week from all stores plus from online purchases. Time to process such

Re: Merging small files

2014-07-20 Thread Kilaru, Sambaiah
I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small files better agreed. Storing is not the answer, You wanted to run the job and what happens? A container stores files and container gets replicated, that means one container (of 256MB or 128MB or what ever

Re: Merging small files

2014-07-20 Thread Adaryl Bob Wakefield, MBA
Yeah I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to

Data cleansing in modern data architecture

2014-07-20 Thread Adaryl Bob Wakefield, MBA
In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that

Re: Merging small files

2014-07-20 Thread Mark Kerzner
Bob, you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed Mark On Sun, Jul 20, 2014 at 1:31 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Yeah I’m sorry I’m

Re: Merging small files

2014-07-20 Thread Adaryl Bob Wakefield, MBA
That’s an interesting use case for Storm. Usually people talk about Storm in terms of processing things like twitter or events like web logs. Never seen it in terms of processing files especially EDI files where they usually come in in terms of groups of transactions instead of atomic events

Re: Data cleansing in modern data architecture

2014-07-20 Thread Shahab Yunus
I am assuming you meant the batch jobs that are/were used in old world for data cleansing. As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. It is also dependent on the technology being used and how it manages

Webhdfs - namenoderpcaddress is not specified exception

2014-07-20 Thread Pham Phuong Tu
I am a new user of Webhdfs, my Hadoop version is 2.4 (Horton work 2.1), i am getting exception when run below command like this: curl -i -L “http://192.168.1.115:50075/webhdfs/v1/data/test1.txt?op=OPEN” HTTP/1.1 400 Bad Request Cache-Control: no-cache Expires: Sat, 19 Jul 2014 03:50:58 GMT Date:

Re: Replace a block with a new one

2014-07-20 Thread Zesheng Wu
Thanks for reply, Arpit. Yes, we need to do this regularly. The original requirement of this is that we want to do RAID(which is based reed-solomon erasure codes) on our HDFS cluster. When a block is corrupted or missing, the downgrade read needs quick recovery of the block. We are considering how