Re: Indexing Solr with the web crawler

Erlend Garåsen Thu, 20 Jan 2011 06:51:03 -0800

On 20.01.11 15.21, Karl Wright wrote:

Hi Erlend,


Hi Karl,

Thank you for replying and for your comments. It's very appreciated.

(1) The best way to find out what ManifoldCF thinks it is doing is to
look at the Simple History report in the UI.


It says:

01-20-2011 15:14:18.914 document ingest (solr_indexer)http://ridder.uio.no/

        500     588     9       lazy loading error
01-20-2011 15:14:18.800         fetch   http://ridder.uio.no/
        200     588     103

01-20-2011 15:13:18.581 document ingest (solr_indexer)http://ridder.uio.no/

        500     588     16      lazy loading error
01-20-2011 15:13:18.448         fetch   http://ridder.uio.no/
        200     588     111

(2) The Web Connector in ManifoldCF does not have the ability, at this
time, to extract links from Word docs, pdfs, etc., but Solr can
extract *content* from these documents if you configure it to use
Tika.  The document is sent to Solr in binary form, and Tika extracts
whatever metadata it can find.  ManifoldCF does not get involved in
that at all.  Usually, setting up Solr with anonymous fields is the
way to go in this case.

Thanks for clarifying. I can try to configure Solr to parse thesedocuments. Nutch did a good job except that it cannot detect whether adocument was modified in order to send an update/delete commando toSolr. That function is crucial for us.

I'm unsure about what you mean by anonymous fields in Solr. I cannotdefine the fields I need in schema.xml as I want? I have createdduplicate fields for title and content in order to use differentstemmers (I need to support English and Norwegian). In Nutch there is asimple configuration file for mapping fields from Nutch to Solr.

If this is an open site, I'll crawl it here myself momentarily and let
you know what I find.

Please do that. It's just my workstation with an Apache server running.It's open.


BTW, I think I have set things up correctly for the crawler:
Seeds: http://ridder.uio.no/

Inclusions: ^http://ridder.uio.no/.* (checked for "include only hostsmatching seeds)

I havent't filled out the "expiration interval (if continuous)." underthe scheduling folder. Is this the reason why ManifoldCF is recrawlingthe page every minute?


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Reply via email to