Re: Nutch not indexing full collection

lewis john mcgibbney Wed, 27 Jul 2011 09:18:14 -0700

has this been solved?

If your http.content.limit has not been increased in nutch-site.xml then you
will not be able to store this data and index with Solr.


On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <ccalh...@aip.org> wrote:

> I'm still having trouble.  I've set a windows environment variable,
> NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I now have
> my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local
> folder.  But I'm still not crawling files later on my urls list, and
> apparently I can't search for words or phrases toward the end of any of my
> documents.  Am I misremembering that there was a total file size value
> somewhere in Nutch or Solr that needs to be increased?
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Wednesday, July 20, 2011 5:23 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch not indexing full collection
>
> Hi Chip,
>
> I would try running your scripts after setting the environment variable
> $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
>
> On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <ccalh...@aip.org> wrote:
>
> > I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
> > and I'm pretty sure that's the correct file.  I run my commands while
> > in $NUTCH_HOME/ , which means all of my commands begin with
> > "runtime/local/bin/nutch..." .  That means my urls directory is
> > $NUTCH_HOME/urls/ and my crawl directory ends up being
> > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and
> > so forth), but it does seem to at least be getting my urlfilters from
> > $NUTCH_HOME/runtime/local/conf/ .
> >
> > I get no output when I try runtime/local/bin/nutch readdb -stats , so
> > that's weird.
> >
> > I dimly recall there being a total index size value somewhere in Nutch
> > or Solr which has to be increased, but I can no longer find any
> > reference to it.
> >
> > Chip
> >
> > -----Original Message-----
> > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> > Sent: Wednesday, July 20, 2011 10:06 AM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not indexing full collection
> >
> > I'd have suspected db.max.outlinks.per.page but you seem to have set
> > it up correctly. Are you running Nutch in runtime/local? in which case
> > you modified nutch-site.xml in runtime/local/conf, right?
> >
> > nutch readdb -stats will give you the total number of pages known etc....
> >
> > Julien
> >
> > On 20 July 2011 14:51, Chip Calhoun <ccalh...@aip.org> wrote:
> >
> > > Hi,
> > >
> > > I'm using Nutch 1.3 to crawl a section of our website, and it
> > > doesn't seem to crawl the entire thing.  I'm probably missing
> > > something simple, so I hope somebody can help me.
> > >
> > > My urls/nutch file contains a single URL:
> > > http://www.aip.org/history/ohilist/transcripts.html , which is an
> > > alphabetical listing of other pages.  It looks like the indexer
> > > stops partway down this page, meaning that entries later in the
> > > alphabet aren't indexed.
> > >
> > > My nutch-site.xml has the following content:
> > > <?xml version="1.0"?>
> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > <!-- Put site-specific property overrides in this file. -->
> > > <configuration> <property>  <name>http.agent.name</name>  <value>OHI
> > > Spider</value> </property> <property>
> > > <name>db.max.outlinks.per.page</name>
> > >  <value>-1</value>
> > >  <description>The maximum number of outlinks that we'll process for
> > > a
> > page.
> > >  If this value is nonnegative (>=0), at most
> > > db.max.outlinks.per.page outlinks  will be processed for a page;
> > > otherwise, all outlinks will be processed.
> > >  </description>
> > > </property>
> > > </configuration>
> > >
> > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the
> > > following, which should allow access to everything I want:
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > > # skip everything else
> > > -.
> > >
> > > I've crawled with the following command:
> > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> > >
> > > Note that since we don't have NutchBean anymore, I can't tell
> > > whether this is actually a Nutch problem or whether something is
> > > failing when I port to Solr.  What am I missing?
> > >
> > > Thanks,
> > > Chip
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Nutch not indexing full collection

Reply via email to