Hi Dru,
> From: Dru Jensen <[EMAIL PROTECTED]>
> Subject: Re: Unknown Scanner Exception
> To: [email protected]
> Date: Tuesday, August 12, 2008, 12:22 PM
> Hi Andy,
>
> I am pulling html from different web pages and storing it
> in hbase. I tried to use heretrix and nutch but they don't
> have big table support (yet) and I don't need to index, I
> just need to store them for archiving purposes.
>
> I think it would be an excellent example or use case since
> it seems I'm not the only one running into these issues.
Well you and I will certainly run into the same issues because
we're doing very similar things.
I wrote a "crawler" that can be distributed out as a map task
that uses a DHT to synchronize activities. (Could have used
HBase for synchronization maybe but the DHT also underpins a
HBase (and Hadoop) service monitoring infrastructure so the DHT
needed to be there anyhow.) The DHT is custom also but that's
another story...
> I am storing the results in a column family in the same
> table I am scanning. Maybe I should use a different table to
> store the results?
Using a separate table to store the results will reduce
contention.
Another possibility is to run a MR job ahead of time to build
a worklist in DFS and avoid use of TableMap entirely. This
would also allow you to split the work into more maps. It
doesn't make sense to have more mappers than table regions,
but a file split does not have the same I/O considerations.
> My next big challenge is performance. It took 18 hours to
> pull 8000 pages and the task never completed. It launched 4
> MR tasks. Not sure if I got a lock on the table that
> wouldn't release or what happened. I am going to add more
> logging and try to track down what is causing the slowness.
I'd be interested to hear if logging turns up anything.
In general crawling, especially if you are recursively following
links (are you?), can take a long time... Often remote servers
are quite slow. I set a socket and connection timeout for
commons-httpclient and retry. Also using an iterative worklist
approach instead of recursion is better -- consider adding
discovered URLs to the worklist and have them picked up the next
time the MR job is run, and rerun the MR task until the worklist
is empty.
> Is it better to commit during the reduce task or inside the
> map task?
Inside the map task definitely. Job failure at the map stage
would force you to redo anything that might be in the collector.
Potentially a lot of wasted effort.
Hope this helps,
- Andy