Re: Crawler Output Flat file or Database?

yanky young Mon, 06 Apr 2009 09:51:17 -0700

Hi:

It is more wise to store files in DFS rather than in database.
database is for structured data or data with schema. flat file is also
not good for large data storage. DFS provide out-of-box replication
for fault tolerance and what's more than is the mapreduce framework
can be used on DFS to do large volume data processing. hadoop has some
interesing sub projects such as hive and pig. Also u can just write
your pages into DFS without bothering mapreduce stuff, but with
mapreduce you can do work more quickly.


And in your case, you use your own crawler, and want to feed your
crawled pages into Solor. I think you can borrow ideas from nutch to
leverage the power of mapreduce in your own crawler, as dennis said.

good luck

yanky



2009/4/2, ram_sj <rpachaiyap...@gmail.com>:
>
> Its a nice explanation about Hadoop/Nutch data handling capability.
>
> thanks for taking your time to answer me.
>
> ram
>
>
>
>
> Dennis Kubes-2 wrote:
>>
>>
>>
>> ram_sj wrote:
>>> Hi,
>>>
>>> I'm trying to provide search functionality for our website using Apache
>>> Solr. We have a in-house developed crawler which provides few required
>>> functionality in handy.
>>>
>>> My question is, the current crawler program tries to save all the data in
>>> to
>>> the database. Is it a good approach to save all crawler data in to
>>> database?
>>> or to leave it as some sort of flat file (XML/HTML)?, We are hoping that
>>> our
>>> data will grow rapidly. Assume that my next step is to import all the
>>> data
>>> from database to Solr index.
>>
>> With the Nutch crawler the webpage contents are held in a MapFile (aka a
>> binary database) as it is assumed to be processed using Hadoop MapReduce
>> and DFS.  With DFS it won't matter the size of the file.  Case in point
>> we had some MapFiles that were > 250G in size for a single file.  You
>> can always write a MR job to pull the content out and into a flat file
>> if that is better for your application.
>>
>> For your in-house crawler, saving information in a database will work up
>> to a given size, usually 30-50M pages depending on your database size.
>> Then the processing time pulling it in and out of a relational database
>> will become to much to be efficient.  Working with data sizes in this
>> range is really what Hadoop and MR were made for.  If you are keeping it
>> in your relational database and you want to still index with Nutch you
>> would need to write a conversion program to convert from the database to
>> Nutch segments.  From there other programs should work.  Note I don't
>> recommend this approach, just giving it as an example.
>>
>> In terms of putting the content into Solr.  The new Nutch-Solr
>> integration functionality should be able to handle that directly from
>> Nutch segments during indexing.
>>
>> Dennis
>>
>>>
>>> Any suggestion would be helpful and appreciated.
>>>
>>> Thanks
>>> Ram
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/Crawler-Output-Flat-file-or-Database--tp22774610p22831987.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Crawler Output Flat file or Database?

Reply via email to