Dru Jensen wrote:
Hi Andy,
I am pulling html from different web pages and storing it in hbase. I
tried to use heretrix and nutch but they don't have big table support
(yet) and I don't need to index, I just need to store them for
archiving purposes.
So, now instead, you are crawling into hdfs and then uploading from there?
My next big challenge is performance. It took 18 hours to pull 8000
pages and the task never completed.
It took 18 hours to run your MR job? Thats way too long. Or to pull
down 8k pages with the crawler?
It launched 4 MR tasks. Not sure if I got a lock on the table that
wouldn't release or what happened. I am going to add more logging and
try to track down what is causing the slowness.
If 0.2.0, you might have been running into hbase-820. Your MR jobs were
retried? What error on failed tasks? You might try a task per (W)ARC
file (if you were crawling with Heritrix)
I am storing the results in a column family in the same table I am
scanning. Maybe I should use a different table to store the results?
Same table is fine if you are storing results into different column family.
Is it better to commit during the reduce task or inside the map task?
See http://wiki.apache.org/hadoop/Hbase/MapReduce for some pointers. In
particular, last paragraph in 'Hbase as MapReduce job data source and sink'.
Earlier versions, I was using the IdentityTableReducer but if the map
task failed, I would lose all results up to that point which (after
running for 18 hrs) made me want to change career paths.
Smile.
Did the task not retry?
Out of interest, what kind of Archiving are you about?
St.Ack
Dru
On Aug 12, 2008, at 10:45 AM, Andrew Purtell wrote:
Dru,
My USE issues with TableMap were also related to HTTP
transactions in the map taking too long. Might make for a useful
design note. I'd be curious to know more details about what you
are trying to accomplish if you are willing to share them...
- Andy
From: Dru Jensen <[EMAIL PROTECTED]>
Subject: Re: Unknown Scanner Exception
To: [email protected]
Date: Tuesday, August 12, 2008, 10:00 AM
J-D and Andy,
This seems to solve the problem. I thought I had set this
parameter before but realized I set the "master" lease time
instead of the "region server" lease time.
[...]
The MR task makes http calls, so I also needed to set the
timeout on the call to make sure it doesn't take longer than
the ping back to the server.
[...]