Good points sir. Specially the second one. How the splits will get generated?
Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 11:04 PM, Robert Evans <ev...@yahoo-inc.com> wrote: > I don't know of any input format that will do this out of the box. But it > should not be that hard to write one. There are two big issues here. > > > 1. the data you are reading form the API really needs to be static, or > you could get some very odd inconsistencies. For example a node dies after > a map task has finished and not all of the reducers got the data, so the > map task is rerun and some of the reducers have some old data, and some of > the reducers have new data. This is the main reason to download the data > before processing it. You can work around this by using the input format > to run a map only job that then writes the data out to a file before > processing it the rest of the way. > 2. You need a good way to partition the data from the API. This can > be difficult unless the REST API provides a logical way to split this up. > > --Bobby > > From: Yaron Gonen <yaron.go...@gmail.com> > Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Date: Tuesday, February 19, 2013 4:49 AM > To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Subject: InputFormat for some REST api > > Hi, > Do you know of any InputFormat implemented for some REST api provider? > Usually when one needs to process data that is accessible only by REST, > one should try to download the data first someone, but what if you cannot > download it? > > thanks >