On Monday, March 17, 2003, at 11:47 PM, Upayavira wrote:
I have built a site which I want to index with Lucene.
I am using the create-index.xsp file in the $COCOON-ROOT/search directory to
build my index.
I have added the following to cocoon.xconf:
<cocoon-crawler logger="core.search.crawler"> <exclude>.*/search/.*</exclude> <link-view-query>cocoon-view=lucene-links</link-view-query> </cocoon-crawler>
<lucene-xml-indexer logger="core.search.lucene"> <store-fields>body</store-fields> <content-view-query>cocoon-view=lucene-content</content-view-query> </lucene-xml-indexer>
This all looks fine My exclude string looks like this though :
<exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,.*/search/.*,.*/ easy/.*</exclude>
I believe as soon as you specify an exclude string, the default values for images etc. are not used.
I've set up a view lucene-links which works, giving back just links from a page.
I've set up a view lucene-content just giving back the content. The content is like:
<page>
<links>....list of links</links>
<body>... the body content ...</body>
</page>
I have had it partially working (indexing both links and body), but now whenever I
run create-index, it fails with a Cannot parse!: org.xml.sax.SAXParseException:
Premature end of file.
Any ideas what I might be doing wrong?
I got problems like this, it turned out to be pages that did not return valid xml. Look in your logs to see if indexing stops on a particular url.
I also found that I could overcome the need to provide more memory by stripping un-needed tags from my 'content' xml being indexed.
My content for indexing looks like this:
<body> <title>title gets stored, then displayed with hit</title> <summary>summary gets stored, then displayed with hit</summary> all of my body content with tags stripped out </body>
Hope this helps
regards Jeremy
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]