On Monday, March 17, 2003, at 11:47 PM, Upayavira wrote:


I have built a site which I want to index with Lucene.

I am using the create-index.xsp file in the $COCOON-ROOT/search directory to
build my index.


I have added the following to cocoon.xconf:

  <cocoon-crawler logger="core.search.crawler">
    <exclude>.*/search/.*</exclude>
    <link-view-query>cocoon-view=lucene-links</link-view-query>
  </cocoon-crawler>

  <lucene-xml-indexer logger="core.search.lucene">
  <store-fields>body</store-fields>
    <content-view-query>cocoon-view=lucene-content</content-view-query>
  </lucene-xml-indexer>


This all looks fine My exclude string looks like this though :

<exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,.*/search/.*,.*/ easy/.*</exclude>

I believe as soon as you specify an exclude string, the default values for images etc. are not used.

I've set up a view lucene-links which works, giving back just links from a page.
I've set up a view lucene-content just giving back the content. The content is like:
<page>
<links>....list of links</links>
<body>... the body content ...</body>
</page>


I have had it partially working (indexing both links and body), but now whenever I
run create-index, it fails with a Cannot parse!: org.xml.sax.SAXParseException:
Premature end of file.


Any ideas what I might be doing wrong?

I got problems like this, it turned out to be pages that did not return valid xml. Look in your logs to see if indexing stops on a particular url.


I also found that I could overcome the need to provide more memory by stripping un-needed tags from my 'content' xml being indexed.

My content for indexing looks like this:

<body>
        <title>title gets stored, then displayed with hit</title>
        <summary>summary gets stored, then displayed with hit</summary>
        all of my body content with tags stripped out
</body>

Hope this helps

regards Jeremy


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to