Re: MultiFieldQueryParser default operator

2003-10-30 Thread Michael Giles
This would be great to get fixed (I think I emailed a similar question a month or so ago). If MultiFieldQueryParser is being mucked with, the constructor should be updated to take an array of fields instead of the single field it takes currently. The code snippet below is actually passing the

Re: Exotic format indexing?

2003-10-30 Thread Ryan Ackley
> Finally, a while back, somebody on this list mentioned quiet a > different approach: simply read the raw binary document and go fishing > for what looks like text. I would like to try that :) I have tried that approach and it works ok. You end up with a bunch of junk in with the useful stuff. It

Re: Exotic format indexing?

2003-10-30 Thread petite_abeille
On Oct 30, 2003, at 20:48, Ben Litchfield wrote: Unfortunately, it is not quite so easy. I am not sure about Word documents The raw text is visible. but PDFs usually have there contents compressed Yep. PDF is really an image format ;) so a raw "fishing" around for text would be pointless. That'

RE: Exotic format indexing?

2003-10-30 Thread Chong, Herb
Word documents with FastSave enabled contain the original document and then deltas to the document until the deltas exceed a certain size and then they are merged back into the document. that means that unless you run the deltas, you won't know what the actual final contents are. Herb

Re: 182 file formats for lucene!!! was: Re: Exotic format indexing?

2003-10-30 Thread petite_abeille
Hi Stefan, On Oct 30, 2003, at 21:02, Stefan Groschupf wrote: just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office. I simply use open office and use the available java api. Yes, I saw that. Great work :) Unfortunately, using Op

182 file formats for lucene!!! was: Re: Exotic format indexing?

2003-10-30 Thread Stefan Groschupf
Hi there, just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office. I simply use open office and use the available java api. It is really straight forward to use. Found some info's and a link to the open source code here: http://so

Re: Exotic format indexing?

2003-10-30 Thread Ben Litchfield
Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw "fishing" around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you. Ben On Thu,

Exotic format indexing?

2003-10-30 Thread petite_abeille
Hello, Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a popular question on this list... The traditional approach seems to be to try to find some kind of format specific reader to properly extract the textual part of such documents for indexing. The drawback of such an appro

Re: term counts during indexing

2003-10-30 Thread Peter Keegan
As I understand it, the field text is being tokenized by the analyzer when IndexWriter.addDocument is called. At this point, the tokens are indexed and/or stored. Would it be possible for 'addDocument' to save and make the _actual_ counts of 'tokens stored' and 'tokens indexed' available in either

Re: Term out of order.

2003-10-30 Thread petite_abeille
On Oct 30, 2003, at 13:36, Pasha Bizhan wrote: I think that it's problem of java version of Lucene. Because all core algorithms of Lucene and Lucene.Net are identical. Talking of which... it appears... that... something... is... wrong... somewhere... This definitely needs some additional invest

Re: MultiFieldQueryParser default operator

2003-10-30 Thread Maurice Coyle
thanks otis, it was on the lucene-dev list (to which i'm not subscribed). the link to the message containing bernhard's solution is below http://www.mail-archive.com/[EMAIL PROTECTED]/msg03863.html happy days. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lu

Re: MultiFieldQueryParser default operator

2003-10-30 Thread Erik Hatcher
It was posted on lucene-dev, not lucene-user. I've pasted it below. I will be fixing this at some point in the near future based on this fix and other related ones needed. Erik On Thursday, October 30, 2003, at 09:31 AM, Otis Gospodnetic wrote: I believe a person just sent an email with a so

Re: MultiFieldQueryParser default operator

2003-10-30 Thread Otis Gospodnetic
I believe a person just sent an email with a solution yesterday or the day before. Look for a message with MultiFieldQueryParser in its Subject. Otis --- Maurice Coyle <[EMAIL PROTECTED]> wrote: > are there any plans to implement some sort of > MultiFieldQueryParser.setOperator(int) method so fo

MultiFieldQueryParser default operator

2003-10-30 Thread Maurice Coyle
are there any plans to implement some sort of MultiFieldQueryParser.setOperator(int) method so folk can search using AND by default? or has anyone worked around the lack of such a method and managed to search over multiple fields using a default-AND query? ---

Re: Indexing txt-files

2003-10-30 Thread Günter Kukies
Hi, is it possible to upgrade the API-doc, that a Buffered Reader is a must. Who is responsible for closing the InputStream? Does doc.add() the close? Günter - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, Octobe

RE: Term out of order.

2003-10-30 Thread Pasha Bizhan
Hi, > From: Victor Hadianto [mailto:[EMAIL PROTECTED] > > I have been searching on the mailing list for this specific > issue, but none > to avail. Does this occur on the Java version of Lucene? I think that it's problem of java version of Lucene. Because all core algorithms of Lucene and Luce

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
Strange. FileReader works fine in my java.net article code. On Thursday, October 30, 2003, at 06:35 AM, Otis Gospodnetic wrote: This is from Lucene demo, included in Lucene distribution. Look at FileDocument class: // Add the contents of the file a field named "contents". Use a Text //

Re: Term out of order.

2003-10-30 Thread Otis Gospodnetic
I was answering the "is this the problem in the Java version of Lucene or just .Net implementation" part of the question. Otis (sometimes answers emails very very late at night) Gospodnetic --- Terry Steichen <[EMAIL PROTECTED]> wrote: > What kind of response is this? (e.g. "apparently so.") I

Re: Indexing txt-files

2003-10-30 Thread Otis Gospodnetic
This is from Lucene demo, included in Lucene distribution. Look at FileDocument class: // Add the contents of the file a field named "contents". Use a Text // field, specifying a Reader, so that the text of the file is tokenized. // ?? why doesn't FileReader work here ?? FileInput

Re: lucene indexing and searching engine performance

2003-10-30 Thread Otis Gospodnetic
Look at the Benchmarks page on Lucene's site. It is not complete (heh, it can never be complete), but it will give you some ideas about Lucene's performance. Feel free to submit your benchmarks, using this template: http://jakarta.apache.org/lucene/docs/benchmarktemplate.xml Thank you, Otis -

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
I'm not an expert on Readers/InputStreams, but it sounds like you're dealing with a bug related to your usages of them and not Lucene. Have a look at my Lucene Intro article where I use a FileReader. Try a simple test using something like that eliminating as many variables as you can. Erik

Re: Indexing txt-files

2003-10-30 Thread Günter Kukies
Yes, i know that it is indexed and the contents is not stored. That is what i want. But that means that I can search the index and i get back the lucene-document as a hit result with all the other fields(date, file-location,...) So my problem is that I don't get back the LUCENE-Document. Maby I nee

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
Also, referring to my article may help - the code is designed to index text files: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote: Hello, I want to add a Text field to a LUCENE Document. I checked the index wit

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
Field.Text(String, Reader) is an unstored field. It is indexed, but the contents are not stored in the index. If you want the contents stored, use Field.Text(String,String) Erik On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote: Hello, I want to add a Text field to a LUCENE

lucene indexing and searching engine performance

2003-10-30 Thread Alex Aw Seat Kiong
Hi Doug Cutting ! That's really very helpful, thanks to Doug. I'm doing the performance research of the lucene speed of indexing and searching engine. So, isn't able to give me more details of 1. searching >But if you > need to search two million 2kB documents on a 500Mhz Pentium with 128MB of >

Indexing txt-files

2003-10-30 Thread Günter Kukies
Hello, I want to add a Text field to a LUCENE Document. I checked the index with LUKE, but I don't get any results for search in the contents Field. The test.txt is a simple ASCII-File. SimpleAnalyzer is used on both sides search and index. Here are the relevant code snippets: File file = n