Lucene and SAX

2005-10-25 Thread Malcolm Clark
Hi again, I am desperately asking for aid!! I have used the sandbox demo to parse the INEX collection.The problem being it points to a volume file which references 50 other xml articles.Lucene only treats this as one document.Is there any method of which I'm overlooking that halts after each r

Re: Lucene and SAX

2005-10-25 Thread Grant Ingersoll
I am not familiar with the INEX collection, could you post a sample? Malcolm Clark wrote: Hi again, I am desperately asking for aid!! I have used the sandbox demo to parse the INEX collection.The problem being it points to a volume file which references 50 other xml articles.Lucene only tre

Re: Lucene and SAX

2005-10-25 Thread Malcolm
It's XML like this. It has 120-ish volumes with references to 12,107 articles which are like this below: A1003 10.1041/A1003s-1995 IEEE Annals of the History of Computing 1058-6180/95/$4.00 © 1995 IEEE Vol. 17, No. 1 Spring1995 pp. 3-3 About this Issuepp. 3-3 J.A.N.LeeEditor‐in‐Chief The firs

Re: Lucene and SAX

2005-10-25 Thread Grant Ingersoll
From what I can see, you are only passing volume.xml to your parser. If I understand your code and questions correctly, the Volume file simply points to the actual articles that you want to parse. Seems like you need to parse the Volume file, get the name/location of the article file and then

Re: Lucene and SAX

2005-10-25 Thread Malcolm
Hi Grant, A highly shortened version of the volume is like below. ]> IEEE Annals of the History of Computing Spring 1995 (Vol. 17, No. 1) Published by the IEEE Computer Society About this Issue &A1003; Comments, Queries, and Debate &A1004; Articles &A1006;

Re: Lucene and SAX

2005-10-25 Thread Malcolm
I'm not in anyway an expert, in fact far from, but when I try to reference each article seperately it complains of entitites as the XML articles are not well-formed. Thanks, MC - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Lucene and SAX

2005-10-25 Thread Grant Ingersoll
Sounds like you need to make your articles XML or stop trying to use an XML parser to process the file, whichever is easier for you. I don't think your issues are Lucene related. I think you need to get a better handle on the XML processing. As I suggested on your Digester thread before, I w

Re: Lucene and SAX

2005-10-31 Thread Karl Øie
Hi there Malcolm! I can´t see any place in your source that you add the document id of the document you are parsing. startDocument() should atleast add a sys-id field for the xml document being parsed; public void startDocument() { mDocument = new Document(); mDocument.add(new Field(

Re: Lucene and Sax

2005-10-31 Thread MALCOLM CLARK
Grant, Thanks for your help with the problem I was experiencing. I split it all down and realised the problem was the location of the IndexWriting(It was not in the correct place within the SAX processing) and also becuase of some poor error handling on my part. kind thanks, Malcolm

Re: Lucene and SAX

2005-10-31 Thread MALCOLM CLARK
Grant, Thanks for your tips.I have considered DOM processing but it seemed to take a hell of a long time to process all the documents(12,125).

Re: Lucene and Sax

2005-10-31 Thread MALCOLM CLARK
Karl, Thanks for your tips.I have considered DOM processing but it seemed to take a hell of a long time to process all the documents(12,125). Malcolm Clark