Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Tejas Patil Fri, 28 Jun 2013 11:00:11 -0700

On Thu, Jun 27, 2013 at 12:24 AM, Tony Mullins <tonymullins...@gmail.com>wrote:


> I am grateful for the help community is giving me and I wont be able to do
> it without their help.
>
> When I was using Cassandra, it only created sinlge 'webpage' table,  if I
> ran my jobs without crawlId (directly from eclipse) or with crawlId it
> always used the same 'webpage' table.
> This is not the case with HBase, as HBase creates a table like
> 'crawlId_webpage' , so what I was saying is it possible to achieve the same
> behavior (Cassandra's)  with Hbase  ( to make HBase only create single
> 'webpage' table even if I give crawlId to my bin/crawl script ) ?
>

You can customize your bin/crawl script to get that done. Currently it
passes the crawlId argument to the nutch commands. You can can check the
usage of those commands and figure out if they accept "-all"
AFAIK, fetch and parse commands have a "-all" param which you can use.
Updatedb does not need it as by default it works over all batches.

And I think this log is generated due to the same issue I mentioned above :
> "Keyclass and nameclass match but mismatching table names  mappingfile
> schema is 'webpage' vs actual schema 'C11_webpage' , assuming they are the
> same."
>

I have no clue what this is about. I will be looking into this in coming
days.

>
> And what do you meant by the "status of URLs" ?
>

Those indicate the status of the url. [0] is a shameless plug of my answer
over stackoverflow which tells what each status stands for.

These are the logs when I run my job for the first time ( Inject ->
> generate -> fetch -> parse -> DBUpdate) and for 2 or 3 depth levels (
> generate -> fetch -> parse -> DBUpdate)
>
> I always get these
> *status:    2 (status_fetched)*
> fetchTime:    0
> prevFetchTime:    0
> fetchInterval:    0
> retriesSinceFetch:    0
> modifiedTime:    0
> prevModifiedTime:    0
> protocolStatus:    (null)
>
>
> Thanks again for your help.
> Tony.
>

[0]  :
http://stackoverflow.com/questions/16853155/where-can-i-find-documentation-about-nutch-status-codes/16869165#16869165

>
>
>
> On Thu, Jun 27, 2013 at 2:33 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > On Wed, Jun 26, 2013 at 4:30 AM, Tony Mullins <tonymullins...@gmail.com
> > >wrote:
> >
> > >
> > > Is it possible to crawl with crawlId but HBase only crates 'webpage'
> > table
> > > without crawlId prefix , just like Cassandra does?
> > >
> >
> > I can't understand this question Tony.
> >
> >
> > >
> > > And my other problems of DBUpdateJob's exception on some random urls
> and
> > > repeating/mixed html of all urls present in seed.txt are also resolved
> > > (disappeared) with HBase backend.
> > >
> >
> > Good
> >
> >
> > > Am I suppose to get proper values here or these are the expected output
> > in
> > > ParseFilter plugin ?
> > >
> > > What is the status of the URLs which have the null or 0 values for the
> > fields you posted?
> >
> >
> >
> > > PS. Now I am getting correct HTML in ParseFilter with HBase backend.
> > >
> > > Good
> >
>

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Reply via email to