This would be great to get fixed (I think I emailed a similar question a
month or so ago). If MultiFieldQueryParser is being mucked with, the
constructor should be updated to take an array of fields instead of the
single field it takes currently. The code snippet below is actually
passing the
> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)
I have tried that approach and it works ok. You end up with a bunch of junk
in with the useful stuff. It
On Oct 30, 2003, at 20:48, Ben Litchfield wrote:
Unfortunately, it is not quite so easy. I am not sure about Word
documents
The raw text is visible.
but PDFs usually have there contents compressed
Yep. PDF is really an image format ;)
so a raw
"fishing" around for text would be pointless.
That'
Word documents with FastSave enabled contain the original document and then deltas to
the document until the deltas exceed a certain size and then they are merged back into
the document. that means that unless you run the deltas, you won't know what the
actual final contents are.
Herb
Hi Stefan,
On Oct 30, 2003, at 21:02, Stefan Groschupf wrote:
just to let you know, i had implement for the nutch project a plugin
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.
Yes, I saw that. Great work :)
Unfortunately, using Op
Hi there,
just to let you know, i had implement for the nutch project a plugin
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.
It is really straight forward to use.
Found some info's and a link to the open source code here:
http://so
Unfortunately, it is not quite so easy. I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless. Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.
Ben
On Thu,
Hello,
Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
popular question on this list...
The traditional approach seems to be to try to find some kind of format
specific reader to properly extract the textual part of such documents
for indexing. The drawback of such an appro
As I understand it, the field text is being tokenized by the analyzer when
IndexWriter.addDocument is called. At this point, the tokens are indexed
and/or stored. Would it be possible for 'addDocument' to save and make the
_actual_ counts of 'tokens stored' and 'tokens indexed' available in either
On Oct 30, 2003, at 13:36, Pasha Bizhan wrote:
I think that it's problem of java version of Lucene.
Because all core algorithms of Lucene and Lucene.Net are identical.
Talking of which... it appears... that... something... is... wrong...
somewhere...
This definitely needs some additional invest
thanks otis, it was on the lucene-dev list (to which i'm not subscribed).
the link to the message containing bernhard's solution is below
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03863.html
happy days.
- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lu
It was posted on lucene-dev, not lucene-user. I've pasted it below.
I will be fixing this at some point in the near future based on this
fix and other related ones needed.
Erik
On Thursday, October 30, 2003, at 09:31 AM, Otis Gospodnetic wrote:
I believe a person just sent an email with a so
I believe a person just sent an email with a solution yesterday or the
day before. Look for a message with MultiFieldQueryParser in its
Subject.
Otis
--- Maurice Coyle <[EMAIL PROTECTED]> wrote:
> are there any plans to implement some sort of
> MultiFieldQueryParser.setOperator(int) method so fo
are there any plans to implement some sort of
MultiFieldQueryParser.setOperator(int) method so folk can search using AND
by default?
or has anyone worked around the lack of such a method and managed to search
over multiple fields using a default-AND query?
---
Hi,
is it possible to upgrade the API-doc, that a Buffered Reader is a must.
Who is responsible for closing the InputStream? Does doc.add() the close?
Günter
- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, Octobe
Hi,
> From: Victor Hadianto [mailto:[EMAIL PROTECTED]
>
> I have been searching on the mailing list for this specific
> issue, but none
> to avail. Does this occur on the Java version of Lucene?
I think that it's problem of java version of Lucene.
Because all core algorithms of Lucene and Luce
Strange. FileReader works fine in my java.net article code.
On Thursday, October 30, 2003, at 06:35 AM, Otis Gospodnetic wrote:
This is from Lucene demo, included in Lucene distribution.
Look at FileDocument class:
// Add the contents of the file a field named "contents". Use a
Text
//
I was answering the "is this the problem in the Java version of Lucene
or just .Net implementation" part of the question.
Otis (sometimes answers emails very very late at night) Gospodnetic
--- Terry Steichen <[EMAIL PROTECTED]> wrote:
> What kind of response is this? (e.g. "apparently so.") I
This is from Lucene demo, included in Lucene distribution.
Look at FileDocument class:
// Add the contents of the file a field named "contents". Use a
Text
// field, specifying a Reader, so that the text of the file is
tokenized.
// ?? why doesn't FileReader work here ??
FileInput
Look at the Benchmarks page on Lucene's site.
It is not complete (heh, it can never be complete), but it will give
you some ideas about Lucene's performance.
Feel free to submit your benchmarks, using this template:
http://jakarta.apache.org/lucene/docs/benchmarktemplate.xml
Thank you,
Otis
-
I'm not an expert on Readers/InputStreams, but it sounds like you're
dealing with a bug related to your usages of them and not Lucene. Have
a look at my Lucene Intro article where I use a FileReader. Try a
simple test using something like that eliminating as many variables as
you can.
Erik
Yes, i know that it is indexed and the contents is not stored. That is what
i want. But that means that I can search the index and i get back the
lucene-document as a hit result with all the other fields(date,
file-location,...)
So my problem is that I don't get back the LUCENE-Document. Maby I nee
Also, referring to my article may help - the code is designed to index
text files:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote:
Hello,
I want to add a Text field to a LUCENE Document. I checked the index
wit
Field.Text(String, Reader) is an unstored field. It is indexed, but
the contents are not stored in the index.
If you want the contents stored, use Field.Text(String,String)
Erik
On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote:
Hello,
I want to add a Text field to a LUCENE
Hi Doug Cutting !
That's really very helpful, thanks to Doug.
I'm doing the performance research of the lucene speed of indexing and
searching engine.
So, isn't able to give me more details of
1. searching
>But if you
> need to search two million 2kB documents on a 500Mhz Pentium with 128MB of
>
Hello,
I want to add a Text field to a LUCENE Document. I checked the index with LUKE, but I
don't get any results for search in the contents Field. The test.txt is a simple
ASCII-File. SimpleAnalyzer is used on both sides search and index.
Here are the relevant code snippets:
File file = n
26 matches
Mail list logo