Hi - you cannot use wildcards for segments. You need to give one segment or a
-dir segments_dir. Check the usage of your indexer command.
-----Original message-----
> From:Teague James <teag...@insystechinc.com>
> Sent: Thursday 16th January 2014 16:43
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
>
> Hello Markus,
>
> I do get a linkdb folder in the crawl folder that gets created - but it is
> created at the time that I execute the command automatically by Nutch. I just
> tried to use solrindex against yesterday's cawl and did not get any errors,
> but did not get the anchor field or any of the outlinks. I used this command:
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb
> crawl/segments/*
>
> I then tried:
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
> This produced the following errors:
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/.../crawl/linkdb/crawl_fetch
> Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> Input path does not exist: file:/.../crawl/linkdb/parse_data
> Input path does not exist: file:/.../crawl/linkdb/parse_text
> Along with a Java stacktrace
>
> So I tried invertlinks as you had previously suggested. No errors, but the
> above missing directories were not created. Using the same solrindex command
> above this one produced the same errors.
>
> When/How are the missing directories supposed to be created?
>
> I really appreciate the help! Thank you very much!
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, January 16, 2014 5:45 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
>
>
> -----Original message-----
> > From:Teague James <teag...@insystechinc.com>
> > Sent: Wednesday 15th January 2014 22:01
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing URLs from websites
> >
> > I am still unsuccessful in getting this to work. My expectation is
> > that the index-anchor plugin should produce values for the field
> > anchor. However this field is not showing up in my Solr index no matter
> > what I try.
> >
> > Here's what I have in my nutch-site.xml for plugins:
> > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
> > uery-(
> > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-
> > basic|site|optic|
> > urlnormalizer-(pass|reges|basic)</value>
> >
> > I am using the schema-solr4.xml from the Nutch package and I added the
> > _version_ field
> >
> > Here's the command I'm running:
> > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> >
> > The fields that Solr returns are:
> > Content, title, segment, boost, digest, tstamp, id, url, and _version_
> >
> > Note that the url field is the url of the page being indexed and not
> > the
> > url(s) of the documents that may be outlinks on that page. It is the
> > outlinks that I am trying to get into the index.
> >
> > What am I missing? I also tried using the invertlinks command that
> > Markus suggested, but that did not work either, though I do appreciate
> > the suggestion.
>
> That did get you a LinkDB right? You need to call solrindex and use the
> linkdb's location as part of the arguments, only then Nutch knows about it
> and will use the data contained in the LinkDB together with the index-anchor
> plugin to write the anchor field in your Solrindex.
>
> >
> > Any help is appreciated! Thanks!
> >
> > <Markus Jelsma> Wrote:
> > You need to use the invertlinks command to build a database with docs
> > with inlinks and anchors. Then use the index-anchor plugin when
> > indexing. Then you will have a multivalued field with anchors pointing to
> > your document.
> >
> > <Teague James> Wrote:
> > I am trying to index a website that contains links to documents such
> > as PDF, Word, etc. The intent is to be able to store the URLs for the
> > links to the documents.
> >
> > For example, when indexing www.example.com which has links on the page
> > like "Example Document" which points to
> > www.example.com/docs/example.pdf, I want Solr to store the text of the
> > link, "Example Document", and the URL for the link,
> > "www.example.com/docs/example.pdf" in separate fields. I've tried
> > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
> > content, but I am not getting the URLs from the links. There are no
> > document type restrictions in Nutch for PDF or Word. Any suggestions
> > on how I can accomplish this? Should I use a different method than Nutch
> > for crawling the site?
> >
> > I appreciate any help on this!
> >
> >
> >
>
>