-----Original message-----
> From:Teague James <teag...@insystechinc.com>
> Sent: Thursday 16th January 2014 20:23
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> Okay. I had used that previously and I just tried it again. The following 
> generated no errors:
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
> -dir crawl/segments/
> Solr is still not getting an anchor field and the outlinks are not appearing 
> in the index anywhere else.
> To be sure I deleted the crawl directory and did a fresh crawl using:
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> Then
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
> -dir crawl/segments/
> No errors, but no anchor fields or outlinks. One thing in the response from 
> the crawl that I found interesting was a line that said:
> LinkDb: internal links will be ignored.

Good catch! That is likely the problem. 

> What does that mean?

  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality

So change the property, rebuild the linkdb and try reindexing once again :)

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Thursday, January 16, 2014 11:08 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-params 
> k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] 
> [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> You must point to the linkdb via the -linkdb parameter. 
> -----Original message-----
> > From:Teague James <teag...@insystechinc.com>
> > Sent: Thursday 16th January 2014 16:57
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I changed my solrindex to this:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
> > crawl/segments/20140115143147
> > 
> > I got the same errors:
> > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > Java stacktrace
> > 
> > Those linkdb folders are not being created.
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 10:44 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Hi - you cannot use wildcards for segments. You need to give one segment or 
> > a -dir segments_dir. Check the usage of your indexer command. 
> >  
> > -----Original message-----
> > > From:Teague James <teag...@insystechinc.com>
> > > Sent: Thursday 16th January 2014 16:43
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Hello Markus,
> > > 
> > > I do get a linkdb folder in the crawl folder that gets created - but it 
> > > is created at the time that I execute the command automatically by Nutch. 
> > > I just tried to use solrindex against yesterday's cawl and did not get 
> > > any errors, but did not get the anchor field or any of the outlinks. I 
> > > used this command:
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > > crawl/linkdb crawl/segments/*
> > > 
> > > I then tried:
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > crawl/linkdb
> > > crawl/segments/* This produced the following errors:
> > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > > Java stacktrace
> > > 
> > > So I tried invertlinks as you had previously suggested. No errors, but 
> > > the above missing directories were not created. Using the same solrindex 
> > > command above this one produced the same errors. 
> > > 
> > > When/How are the missing directories supposed to be created?
> > > 
> > > I really appreciate the help! Thank you very much!
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: Thursday, January 16, 2014 5:45 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > >  
> > > -----Original message-----
> > > > From:Teague James <teag...@insystechinc.com>
> > > > Sent: Wednesday 15th January 2014 22:01
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Indexing URLs from websites
> > > > 
> > > > I am still unsuccessful in getting this to work. My expectation is 
> > > > that the index-anchor plugin should produce values for the field 
> > > > anchor. However this field is not showing up in my Solr index no matter 
> > > > what I try.
> > > > 
> > > > Here's what I have in my nutch-site.xml for plugins:
> > > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|ancho
> > > > r)
> > > > |q
> > > > uery-(
> > > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scor
> > > > basic|site|in
> > > > basic|site|g-
> > > > basic|site|optic|
> > > > urlnormalizer-(pass|reges|basic)</value>
> > > > 
> > > > I am using the schema-solr4.xml from the Nutch package and I added 
> > > > the _version_ field
> > > > 
> > > > Here's the command I'm running:
> > > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> > > > 
> > > > The fields that Solr returns are:
> > > > Content, title, segment, boost, digest, tstamp, id, url, and 
> > > > _version_
> > > > 
> > > > Note that the url field is the url of the page being indexed and 
> > > > not the
> > > > url(s) of the documents that may be outlinks on that page. It is 
> > > > the outlinks that I am trying to get into the index.
> > > > 
> > > > What am I missing? I also tried using the invertlinks command that 
> > > > Markus suggested, but that did not work either, though I do 
> > > > appreciate the suggestion.
> > > 
> > > That did get you a LinkDB right? You need to call solrindex and use the 
> > > linkdb's location as part of the arguments, only then Nutch knows about 
> > > it and will use the data contained in the LinkDB together with the 
> > > index-anchor plugin to write the anchor field in your Solrindex.
> > > 
> > > > 
> > > > Any help is appreciated! Thanks!
> > > > 
> > > > <Markus Jelsma> Wrote:
> > > > You need to use the invertlinks command to build a database with 
> > > > docs with inlinks and anchors. Then use the index-anchor plugin 
> > > > when indexing. Then you will have a multivalued field with anchors 
> > > > pointing to your document.
> > > > 
> > > > <Teague James> Wrote:
> > > > I am trying to index a website that contains links to documents 
> > > > such as PDF, Word, etc. The intent is to be able to store the URLs 
> > > > for the links to the documents.
> > > > 
> > > > For example, when indexing www.example.com which has links on the 
> > > > page like "Example Document" which points to 
> > > > www.example.com/docs/example.pdf, I want Solr to store the text of 
> > > > the link, "Example Document", and the URL for the link, 
> > > > "www.example.com/docs/example.pdf" in separate fields. I've tried 
> > > > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the 
> > > > page content, but I am not getting the URLs from the links. There 
> > > > are no document type restrictions in Nutch for PDF or Word. Any 
> > > > suggestions on how I can accomplish this? Should I use a different 
> > > > method than Nutch for crawling the site?
> > > > 
> > > > I appreciate any help on this!
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 

Reply via email to