RE: Indexing URLs from websites

Markus Jelsma Thu, 16 Jan 2014 08:14:39 -0800

Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-params 
k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] 
[-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]


You must point to the linkdb via the -linkdb parameter. 
 
-----Original message-----
> From:Teague James <teag...@insystechinc.com>
> Sent: Thursday 16th January 2014 16:57
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Okay. I changed my solrindex to this:
> 
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
> crawl/segments/20140115143147
> 
> I got the same errors:
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: file:/.../crawl/linkdb/crawl_fetch
> Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> Input path does not exist: file:/.../crawl/linkdb/parse_data 
> Input path does not exist: file:/.../crawl/linkdb/parse_text 
> Along with a Java stacktrace
> 
> Those linkdb folders are not being created.
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Thursday, January 16, 2014 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Hi - you cannot use wildcards for segments. You need to give one segment or a 
> -dir segments_dir. Check the usage of your indexer command. 
>  
> -----Original message-----
> > From:Teague James <teag...@insystechinc.com>
> > Sent: Thursday 16th January 2014 16:43
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Hello Markus,
> > 
> > I do get a linkdb folder in the crawl folder that gets created - but it is 
> > created at the time that I execute the command automatically by Nutch. I 
> > just tried to use solrindex against yesterday's cawl and did not get any 
> > errors, but did not get the anchor field or any of the outlinks. I used 
> > this command:
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb crawl/segments/*
> > 
> > I then tried:
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
> > crawl/segments/* This produced the following errors:
> > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > Java stacktrace
> > 
> > So I tried invertlinks as you had previously suggested. No errors, but the 
> > above missing directories were not created. Using the same solrindex 
> > command above this one produced the same errors. 
> > 
> > When/How are the missing directories supposed to be created?
> > 
> > I really appreciate the help! Thank you very much!
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 5:45 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> >  
> > -----Original message-----
> > > From:Teague James <teag...@insystechinc.com>
> > > Sent: Wednesday 15th January 2014 22:01
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Indexing URLs from websites
> > > 
> > > I am still unsuccessful in getting this to work. My expectation is 
> > > that the index-anchor plugin should produce values for the field 
> > > anchor. However this field is not showing up in my Solr index no matter 
> > > what I try.
> > > 
> > > Here's what I have in my nutch-site.xml for plugins:
> > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)
> > > |q
> > > uery-(
> > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin
> > > basic|site|g-
> > > basic|site|optic|
> > > urlnormalizer-(pass|reges|basic)</value>
> > > 
> > > I am using the schema-solr4.xml from the Nutch package and I added 
> > > the _version_ field
> > > 
> > > Here's the command I'm running:
> > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> > > 
> > > The fields that Solr returns are:
> > > Content, title, segment, boost, digest, tstamp, id, url, and 
> > > _version_
> > > 
> > > Note that the url field is the url of the page being indexed and not 
> > > the
> > > url(s) of the documents that may be outlinks on that page. It is the 
> > > outlinks that I am trying to get into the index.
> > > 
> > > What am I missing? I also tried using the invertlinks command that 
> > > Markus suggested, but that did not work either, though I do 
> > > appreciate the suggestion.
> > 
> > That did get you a LinkDB right? You need to call solrindex and use the 
> > linkdb's location as part of the arguments, only then Nutch knows about it 
> > and will use the data contained in the LinkDB together with the 
> > index-anchor plugin to write the anchor field in your Solrindex.
> > 
> > > 
> > > Any help is appreciated! Thanks!
> > > 
> > > <Markus Jelsma> Wrote:
> > > You need to use the invertlinks command to build a database with 
> > > docs with inlinks and anchors. Then use the index-anchor plugin when 
> > > indexing. Then you will have a multivalued field with anchors pointing to 
> > > your document.
> > > 
> > > <Teague James> Wrote:
> > > I am trying to index a website that contains links to documents such 
> > > as PDF, Word, etc. The intent is to be able to store the URLs for 
> > > the links to the documents.
> > > 
> > > For example, when indexing www.example.com which has links on the 
> > > page like "Example Document" which points to 
> > > www.example.com/docs/example.pdf, I want Solr to store the text of 
> > > the link, "Example Document", and the URL for the link, 
> > > "www.example.com/docs/example.pdf" in separate fields. I've tried 
> > > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the 
> > > page content, but I am not getting the URLs from the links. There 
> > > are no document type restrictions in Nutch for PDF or Word. Any 
> > > suggestions on how I can accomplish this? Should I use a different method 
> > > than Nutch for crawling the site?
> > > 
> > > I appreciate any help on this!
> > > 
> > > 
> > > 
> > 
> > 
> 
>

RE: Indexing URLs from websites

Reply via email to