This is not place to discuss merits or demerits of MapR, Small files screw up
very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container
size) and with locality most
Of the mappers go to three datanodes.
You should be looking into sequence file format.
Don't have time to read the thread, but incase it has not been mentioned
Unleash filecrusher!
https://github.com/edwardcapriolo/filecrush
On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah
sambaiah_kil...@intuit.com wrote:
This is not place to discuss merits or demerits of MapR, Small
It isn’t? I don’t wanna hijack the thread or anything but it seems to me that
MapR is an implementation of Hadoop and this is a great place to discuss it’s
merits vis a vis the Hortonworks or Cloudera offering.
A little bit more on topic: Every single thing I read or watch about Hadoop
says
Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E
You can always
“Even if we kept the discussion to the mailing list's technical Hadoop usage
focus, any company/organization looking to use a distro is going to have to
consider the costs, support, platform, partner ecosystem, market share, company
strategy, etc.”
Yeah good point.
Adaryl Bob Wakefield, MBA
Spring batch is used to process the files which come in EDI ,CSV XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated roughly by 5 million customers
every week from all stores plus from online purchases. Time to process
such
I had expericne with mapr where small files are much worse. Mapr can keep (only
keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one
container (of 256MB or 128MB or what ever
Yeah I’m sorry I’m not talking about processing the files in Oracle. I mean
collect/store invoices in Oracle then flush them in a batch to Hadoop. This is
not real time right? So you take your EDI,CSV and XML from their sources. Store
them in Oracle. Once you have a decent size, flush them to
In the old world, data cleaning used to be a large part of the data warehouse
load. Now that we’re working in a schemaless environment, I’m not sure where
data cleansing is supposed to take place. NoSQL sounds fun because
theoretically you just drop everything in but transactional systems that
Bob,
you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed
Mark
On Sun, Jul 20, 2014 at 1:31 PM, Adaryl Bob Wakefield, MBA
adaryl.wakefi...@hotmail.com wrote:
Yeah I’m sorry I’m
That’s an interesting use case for Storm. Usually people talk about Storm in
terms of processing things like twitter or events like web logs. Never seen it
in terms of processing files especially EDI files where they usually come in in
terms of groups of transactions instead of atomic events
I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.
As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.
It is also dependent on the technology being used and how it manages
I am a new user of Webhdfs, my Hadoop version is 2.4 (Horton work 2.1), i
am getting exception when run below command like this:
curl -i -L “http://192.168.1.115:50075/webhdfs/v1/data/test1.txt?op=OPEN”
HTTP/1.1 400 Bad Request
Cache-Control: no-cache
Expires: Sat, 19 Jul 2014 03:50:58 GMT
Date:
Thanks for reply, Arpit.
Yes, we need to do this regularly. The original requirement of this is that
we want to do RAID(which is based reed-solomon erasure codes) on our HDFS
cluster. When a block is corrupted or missing, the downgrade read needs
quick recovery of the block. We are considering how
14 matches
Mail list logo