Re: Are there any web crawlers based on database?

Scott Gonyea Tue, 26 Oct 2010 23:05:16 -0700

Lots of things will "work," the question is all about what you're
doing, specifically.  I avoid trolling with phrases like "MySQL can't
scale" (unless I know I can get a funny response).  MySQL works and
scales wonderfully for a specific set of problems, 'more than good
enough' for most problems, and will make your life needlessly
difficult for some others.


If you post some larger insights into what you want to warehouse from
your crawl data, and what you plan to do with it, I can try to give
some deeper feedback on how to approach it.  But really, nothing too
awful can come from putting it into SQL and picking up your own set of
lessons.  It may well be good enough and have just the right level of
convenience for whomever is using it.

There's no real "right" or "wrong" answer, which is what makes some of
this stuff a real PITA.  Sometimes, it'd be nice if someone told me
what tool to use--so I could move on with my life, and solve the
nonsense I was supposed to.  It's all still very new, right now--but
Solr (thus Lucene) have a fairly established track record in
indexing/cataloguing heavily de-normalized internet sludge.

Scott Gonyea

On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <yangxiao9...@gmail.com> wrote:
> Hi, Scott,
>
> I agree with you on the uselessness of row-locking and transactional
> integrity features. But we can reduce the overhead by reading data by
> block. I mean read many rows(like 1K, or more) at a time, and process
> them in memory. Do you think whether it will work?
>
> Thanks!
> Xiao
>
> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <m...@sgonyea.com> wrote:
>> Not that it's guaranteed to be of "next to no value" but really,
>> you've probably already lost pages just crawling them.  Server /
>> network errors, for example, takes the integrity question and makes it
>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>> Different geographies?
>>
>> Row locking is reasonably nice, but that begs other questions.  It can
>> easily be solved one of two ways:  Put your data is Solr, and persist
>> your efforts in both places:  Solr and an SQL backend.  If you're
>> using riak (or Cassandra), you allow document collisions to exist and
>> reconcile them within your application.
>>
>> It sounds complex, but are actually quite trivial to implement.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <m...@sgonyea.com> wrote:
>>> I love relational databases, but their many features are (in my
>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>> transactional integrity is great for lots of applications, but becomes
>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>> you're doing.
>>>
>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>> like they're going out of style--and it's very powerful.
>>>
>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>> well as make updates to the content that's stored.  Lots of fields /
>>> boolean attributes sit in the schema.
>>>
>>> As the user works through the app, their changes get pushed back into
>>> Solr.  Then when they next hit "Search," results disappear / move
>>> around as they had organized it.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <yangxiao9...@gmail.com> wrote:
>>>> Hi, Scott,
>>>>
>>>> Thanks for your reply.
>>>> I'm curious about the reason why using database is awful.
>>>> Here is my requirement: we have two developers who want to do some
>>>> processing and analysis work on the crawled data. If the data is
>>>> stored in database, we can easily share our data, for the well-defined
>>>> data models. What's more, the analysis results can also be easily
>>>> stored back into the database by just adding a few fields.
>>>> For example, I need to know the average number of urls in one site. In
>>>> database, a single SQL will do. If I want to extract and store the
>>>> main part of web pages, I can't easily modify the data structure of
>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>> through the data set.
>>>> The crawled data is structured, then why not using database?
>>>>
>>>> Thanks!
>>>> Xiao
>>>>
>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <m...@sgonyea.com> wrote:
>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational 
>>>>> database, to warehouse your crawl data, is generally an awful idea.  I'd 
>>>>> go so far as to suggest that you're probably looking at things the wrong 
>>>>> way. :)
>>>>>
>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you 
>>>>> want to get from it, and expose that data the best way possible.  If you 
>>>>> want to store it, index it, query it, transform it, collect statistics, 
>>>>> etc... Solr is a terrific tool.  Amazingly so.
>>>>>
>>>>> That said, you also have another very good choice.  Take a look at Riak 
>>>>> Search.  They hijacked many core elements of Solr, which I applaud, and 
>>>>> is compatible with Solr's http interface.  In effect, you can point 
>>>>> Nutch's solr-index job, instead, at a Riak Search node and put your data 
>>>>> there.
>>>>>
>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you 
>>>>> can search across the Solr indexes, that it's built on top of, or you can 
>>>>> throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>>
>>>>> I don't know of a database that lacks a Java client, so the potential for 
>>>>> indexing plugins is limitless... regardless of where the data is placed.
>>>>>
>>>>> Scott Gonyea
>>>>>
>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>
>>>>>> Hi, guys,
>>>>>>
>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>> difficult to manage and share among applications.
>>>>>> Are there any web crawlers based on relational database?
>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>> scalability problem.
>>>>>>
>>>>>> Thanks!
>>>>>> Xiao
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Reply via email to