Re: Streaming value of (200MB) from a SequenceFile

Rahul Bhattacharjee Sun, 31 Mar 2013 23:44:34 -0700

Thanks Sandy for the excellent explanation. Didn't think about the lose of
data-locality.


Regards,
Rahul


On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Rahul,
>
> I don't think saving the stream for later use would work - I was just
> suggesting that if only some aggregate statistics needed to be calculated,
> they could be calculated at read time instead of in the mapper.  Nothing
> requires a Writable to contain all the data that it reads.
>
> That's a good point that you can pass the locations of the files.  A
> drawback of this is that Hadoop attempts to co-locate mappers with where
> their input data is stored, and this approach would negate the locality
> advantage.
>
> 200 MB is not too small a file for Hadoop.  A typical HDFS block size is
> 64 MB or 128 MB, so a file that's larger than that is not unreasonable.
>
> -Sandy
>
>
> On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
> rahul.rec....@gmail.com> wrote:
>
>> Sorry for the multiple replies.
>>
>> There is one more thing that can be done (I guess) for streaming the
>> values rather then constructing the whole object itself.We can store the
>> value in hdfs as file and have the location as value of the mapper.Mapper
>> can open a stream using the location specified.
>>
>> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
>> many 200 MB size files would have any impact to the NN.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> rahul.rec....@gmail.com> wrote:
>>
>>> Hi Sandy,
>>>
>>> I am also new to Hadoop and have a question here.
>>> The writable does have a DataInput stream so that the objects can be
>>> constructed from the byte stream.
>>> Are you suggesting to save the stream for later use ,but late we cannot
>>> ascertain the state of the stream.
>>> For a large value , I think we can actually take the useful part and
>>> emmit it out of from a mapper , we might also have a custom input format to
>>> do this thing so that large value doesn't even reach the mapper.
>>>
>>> Am I missing anything here?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <chiling...@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm having a problem to stream individual key-value pair of 200MB to
>>>> 1GB from a MapFile.
>>>> I need to stream the large value to an outputstream instead of reading
>>>> the entire value before processing because it potentially uses too much
>>>> memory.
>>>>
>>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>>> val) does not return an input stream.
>>>>
>>>> How can I accomplish this?
>>>>
>>>> Thanks,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Reply via email to