How to manage large record in MapReduce

2011-01-06 Thread Jérôme Thièvre
Hi,

we are currently using Hadoop (version 0.20.2) to manage some web archiving
processes like fulltext indexing, and it works very well with small records
that contains html.
Now, we would like to work with other type of web data like videos. These
kind of data could be really large and of course these records doesn't fit
in memory.

Is it possible to manage record which content doesn't reside in memory but
on disk.
A possibility would be to implements a Writable that read its content from a
DataInput but doesn't load it in memory, instead it would copy that content
to a temporary file in the local file system and allows to stream its
content using an InputStream (an InputStreamWritable).

Has somebody tested a similar approach, and if not do you think some big
problems could happen (that impacts performance) with this method ?

Thanks,

Jérôme Thièvre


Re: How to manage large record in MapReduce

2011-01-07 Thread Sonal Goyal
Jerome,

You can take a look at FileStreamInputFormat at
https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

This provides an input stream per file. In our case, we are using the input
stream to load data into the database directly. Maybe you can use this or a
similar approach for working with your videos.

HTH

Thanks and Regards,
Sonal
Connect Hadoop with databases,
Salesforce, FTP servers and others 
Nube Technologies 







On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre  wrote:

> Hi,
>
> we are currently using Hadoop (version 0.20.2) to manage some web archiving
> processes like fulltext indexing, and it works very well with small records
> that contains html.
> Now, we would like to work with other type of web data like videos. These
> kind of data could be really large and of course these records doesn't fit
> in memory.
>
> Is it possible to manage record which content doesn't reside in memory but
> on disk.
> A possibility would be to implements a Writable that read its content from
> a
> DataInput but doesn't load it in memory, instead it would copy that content
> to a temporary file in the local file system and allows to stream its
> content using an InputStream (an InputStreamWritable).
>
> Has somebody tested a similar approach, and if not do you think some big
> problems could happen (that impacts performance) with this method ?
>
> Thanks,
>
> Jérôme Thièvre
>


Re: How to manage large record in MapReduce

2011-01-07 Thread Jérôme Thièvre INA
Hi Sonal,

thank you, I have just implemented a solution similar to yours (without
copying to a temp file as suggested in my inital post), and it seems to
work.
Best Regards,

Jérôme

2011/1/7 Sonal Goyal 

> Jerome,
>
> You can take a look at FileStreamInputFormat at
>
> https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input
>
> This provides an input stream per file. In our case, we are using the input
> stream to load data into the database directly. Maybe you can use this or a
> similar approach for working with your videos.
>
> HTH
>
> Thanks and Regards,
> Sonal
> Connect Hadoop with databases,
> Salesforce, FTP servers and others 
> Nube Technologies 
>
> 
>
>
>
>
>
> On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre  wrote:
>
> > Hi,
> >
> > we are currently using Hadoop (version 0.20.2) to manage some web
> archiving
> > processes like fulltext indexing, and it works very well with small
> records
> > that contains html.
> > Now, we would like to work with other type of web data like videos. These
> > kind of data could be really large and of course these records doesn't
> fit
> > in memory.
> >
> > Is it possible to manage record which content doesn't reside in memory
> but
> > on disk.
> > A possibility would be to implements a Writable that read its content
> from
> > a
> > DataInput but doesn't load it in memory, instead it would copy that
> content
> > to a temporary file in the local file system and allows to stream its
> > content using an InputStream (an InputStreamWritable).
> >
> > Has somebody tested a similar approach, and if not do you think some big
> > problems could happen (that impacts performance) with this method ?
> >
> > Thanks,
> >
> > Jérôme Thièvre
> >
>


Re: How to manage large record in MapReduce

2011-01-25 Thread lei
Hi Jerome,

I have a similar problem as yours. Would you please share more details about
your solution?

Thanks,
Lei