Thanks Sandy for the excellent explanation. Didn't think about the lose of data-locality.
Regards, Rahul On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hi Rahul, > > I don't think saving the stream for later use would work - I was just > suggesting that if only some aggregate statistics needed to be calculated, > they could be calculated at read time instead of in the mapper. Nothing > requires a Writable to contain all the data that it reads. > > That's a good point that you can pass the locations of the files. A > drawback of this is that Hadoop attempts to co-locate mappers with where > their input data is stored, and this approach would negate the locality > advantage. > > 200 MB is not too small a file for Hadoop. A typical HDFS block size is > 64 MB or 128 MB, so a file that's larger than that is not unreasonable. > > -Sandy > > > On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee < > rahul.rec....@gmail.com> wrote: > >> Sorry for the multiple replies. >> >> There is one more thing that can be done (I guess) for streaming the >> values rather then constructing the whole object itself.We can store the >> value in hdfs as file and have the location as value of the mapper.Mapper >> can open a stream using the location specified. >> >> Not sure if 200 MB file would qualify as small file wrt hadoop or if too >> many 200 MB size files would have any impact to the NN. >> >> Thanks, >> Rahul >> >> >> >> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee < >> rahul.rec....@gmail.com> wrote: >> >>> Hi Sandy, >>> >>> I am also new to Hadoop and have a question here. >>> The writable does have a DataInput stream so that the objects can be >>> constructed from the byte stream. >>> Are you suggesting to save the stream for later use ,but late we cannot >>> ascertain the state of the stream. >>> For a large value , I think we can actually take the useful part and >>> emmit it out of from a mapper , we might also have a custom input format to >>> do this thing so that large value doesn't even reach the mapper. >>> >>> Am I missing anything here? >>> >>> Thanks, >>> Rahul >>> >>> >>> >>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <chiling...@gmail.com>wrote: >>> >>>> Hi everyone, >>>> >>>> I'm having a problem to stream individual key-value pair of 200MB to >>>> 1GB from a MapFile. >>>> I need to stream the large value to an outputstream instead of reading >>>> the entire value before processing because it potentially uses too much >>>> memory. >>>> >>>> I read the API for MapFile, the next(WritableComparable key, Writable >>>> val) does not return an input stream. >>>> >>>> How can I accomplish this? >>>> >>>> Thanks, >>>> >>>> Jerry >>>> >>> >>> >> >