AW: Parsing MSWord

2008-11-12 Thread Sertic Mirko, Bedag
Hi You can also use a tool called "antiword" to extract the text from a .doc file, and then give the text to lucene. See here : http://en.wikipedia.org/wiki/Antiword Regards Mirko -Ursprüngliche Nachricht- Von: dipesh [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 12. November 2008 04:

Re: Parsing MSWord

2008-11-12 Thread Alexander Aristov
Antiword would be hard to inject into Nutch as it is not Java based. It will reqier native calls. Alexander 2008/11/12 Sertic Mirko, Bedag <[EMAIL PROTECTED]> > Hi > > You can also use a tool called "antiword" to extract the text from a .doc > file, and then > give the text to lucene. > > See he

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Michael McCandless
Any takers for pulling a patch together...? Mike Mark Miller wrote: +1 - Mark On Nov 12, 2008, at 4:50 AM, Michael McCandless <[EMAIL PROTECTED] > wrote: I think we really should open up a non-static way to choose a different FSDirectory impl? EG maybe add optional Class to FSDir

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Mark Miller
+1 - Mark On Nov 12, 2008, at 4:50 AM, Michael McCandless <[EMAIL PROTECTED] > wrote: I think we really should open up a non-static way to choose a different FSDirectory impl? EG maybe add optional Class to FSDirectory.getDirectory? Or maybe give NIOFSDirectory a public ctor? Or s

Re: 1:n queries again

2008-11-12 Thread Stefan Trcek
On Wednesday 12 November 2008 14:58:53 Christian Reuschling wrote: > In order to offer some simple 1:n matching, currently we create > several, counted attributes and expand our queries that we search > inside each attribute, e.g.: I use one attribute (Field) multiple times. Stefan -

Re: 1:n queries again

2008-11-12 Thread Erick Erickson
It's entirely unclear to me whether facets could help, since I haven't used them, I've seen these mentioned on the SOLR user list, it may bear investigating. To expand on Stefan's point. I think his solution will work for you quite well, but there are a couple of tricks The first thing to und

Re: 1:n queries again

2008-11-12 Thread Christian Reuschling
But this is not the same - Lucene makes it transparent for you whether you have one or several field entries for one attribute. The behaviour will be the same in both of these cases: Lucene document entry: attName: "term1 term2" attName: "term3 term4" or attName: "term1 term2 term3 term4" For th

Re: AW: Parsing MSWord

2008-11-12 Thread Donna L Gresh
Check out POI; that's what I use http://poi.apache.org/ "Sertic Mirko, Bedag" <[EMAIL PROTECTED]> wrote on 11/12/2008 03:25:47 AM: > Hi > > You can also use a tool called "antiword" to extract the text from a > .doc file, and then > give the text to lucene. > > See here : http://en.wikipedia

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Michael McCandless
I think we really should open up a non-static way to choose a different FSDirectory impl? EG maybe add optional Class to FSDirectory.getDirectory? Or maybe give NIOFSDirectory a public ctor? Or something? Mike Mark Miller wrote: Mark Miller wrote: Thats a good point, and points out

Re: 1:n queries again

2008-11-12 Thread Otis Gospodnetic
Christian, If I understand your situation correctly, you should look at sloppy phrases and at Span family of queries. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Christian Reuschling <[EMAIL PROTECTED]> To: java-user@lucene.apache

Re: AW: Parsing MSWord

2008-11-12 Thread Otis Gospodnetic
Or Tika, Lucene's cousin: http://incubator.apache.org/tika/ (which uses POI under the hood, but goes beyond MS Word parsing) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Donna L Gresh <[EMAIL PROTECTED]> To: java-user@lucene.apache.or

Re: 1:n queries again

2008-11-12 Thread Christian Reuschling
Hello Erick, thank you very much for this interesting idea - but I'm not sure that the SpanQuery will make every aspect I search for. I think the lack is that in the case of a PhraseQuery (and I think also in the case of the SpanQuery, but I'm not sure about yet), every term must appear inside th

Re: 1:n queries again

2008-11-12 Thread Erick Erickson
Note that the SpanQuery family are Querys, so they can be used as clauses of a BooleanQuery just fine. Making this work will be exciting... <<>> I'm having trouble understanding the use case. I don't understand how the user can make sense of this, but then it may well be unique to your problem sp

Lucene implementation/performance question

2008-11-12 Thread Greg Shackles
I hope this isn't a dumb question or anything, I'm fairly new to Lucene so I've been picking it up as I go pretty much. Without going into too much detail, I need to store pages of text, and for each word on each page, store detailed information about it. To do this, I have 2 indexes: 1) pages:

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Mark Miller
I'm thinking about it, so if someone else doesn't get something together before I have some free time... Its just not clear to me at the moment how best to do it. Michael McCandless wrote: Any takers for pulling a patch together...? Mike Mark Miller wrote: +1 - Mark On Nov 12, 2008, at

1:n queries again

2008-11-12 Thread Christian Reuschling
Hello Friends, In order to offer some simple 1:n matching, currently we create several, counted attributes and expand our queries that we search inside each attribute, e.g.: Query 'attName:myTerm' => Query 'attName1:myTerm attName2:myTerm' This is not the fastest way, and sometimes not easy to

Re: Lucene implementation/performance question

2008-11-12 Thread Erick Erickson
If I may suggest, could you expand upon what you're trying to accomplish? Why do you care about the detailed information about each word? The reason I'm suggesting this is "the XY problem". That is, people often ask for details about a specific approach when what they really need is a different app

Re: Lucene implementation/performance question

2008-11-12 Thread Greg Shackles
Hi Erick, Thanks for the response, sorry that I was somewhat vague in the reasoning for my implementation in the first post. I should have mentioned that the word details are not details of the Lucene document, but are attributes about the word that I am storing. Some examples are position on th

Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller
If your new to Lucene, this might be a little much (and maybe I am not fully understand the problem), but you might try: Add the attributes to the words in a payload with a PayloadAnalyzer. Do searching as normal. Use the new PayloadSpanUtil class to get the payloads for the matching words. (T

Re: Lucene implementation/performance question

2008-11-12 Thread Greg Shackles
Hey Mark, This sounds very interesting. Is there any documentation or examples I could see? I did a quick search but didn't really find much. It might just be that I don't know how payloads work in Lucene, but I'm not sure how I would see this actually doing what I need. My reasoning is this..

Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller
Here is a great power point on payloads from Michael Busch: www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. Essentially, you can store metadata at each term position, so its an excellent place to store attributes of the term - they are very fast to load, efficient, etc. Yo

How to get the terms within 5 words of another term?

2008-11-12 Thread Sven
Hi everyone, I have a term "foo" and I want to count all the occurrences of all the terms that are within 5 words of "foo" in all the documents which contain "foo". For simplicity sake, this is only for a single field. So if I have 3 documents (each with a single field) that look like this:

Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller
Greg Shackles wrote: Thanks! This all actually sounds promising, I just want to make sure I'm thinking about this correctly. Does this make sense? Indexing process: 1) Get list of all words for a page and their attributes, stored in some sort of data structure 2) Concatenate the text from tho

How to get the terms within 5 words of another term?

2008-11-12 Thread Sven
Hi everyone, I have a term "foo" and I want to count all the occurrences of all the terms that are within 5 words of "foo" in all the documents which contain "foo". For simplicity sake, this is only for a single field. So if I have 3 documents (each with a single field) that look like this: Onc

About counting term hits

2008-11-12 Thread Fco. Mario Barcala
Hello: I am new to LUCENE and I am testing some issues about it. I can retrieve the number of documents which satisfies a query, but I don't find how to obtain the number of terms which match it. For example, if I search for the word "house", I want to obtain the number of times the word occurs (

Re: Lucene implementation/performance question

2008-11-12 Thread Greg Shackles
> > Right, sounds like you have it spot on. That second * from 3 looks like a > possible tricky part. I agree that it will be the tricky part but I think as long as I'm careful with counting as I iterate through it should be ok (I probably just doomed myself by saying that...) Right...you'd do i

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Dmitri Bichko
>From the user perspective: a public constructor would be the most obvious, and would be consistent with RAMDirectory. Dmitri On Wed, Nov 12, 2008 at 4:50 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I think we really should open up a non-static way to choose a different > FSDirectory im

Re: Lucene implementation/performance question

2008-11-12 Thread Greg Shackles
Thanks! This all actually sounds promising, I just want to make sure I'm thinking about this correctly. Does this make sense? Indexing process: 1) Get list of all words for a page and their attributes, stored in some sort of data structure 2) Concatenate the text from those words (space separat

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Dmitri Bichko
Nice! At 8 threads nio-shared catches up with ram-shared. Here's the complete table: fs-thread nio-thread ram-thread fs-shared nio-shared ram-shared 1 71877 70461 54739 73986 72155 61595 2 34949 34945 26735 43719 33019 28935 3

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Chris Hostetter
: >From the user perspective: a public constructor would be the most : obvious, and would be consistent with RAMDirectory. A lot of the cases where system properties are currently used can't really be solved this way because the client isn't the one constructing the object. SegmentReader's IMP

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Yonik Seeley
On Wed, Nov 12, 2008 at 5:00 PM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > since the choice of FSDirectory varient is largly going to be based on OS, > I can't think of any reason why a static setter method wouldn't be good > enough in this particular case. https://issues.apache.org/jira/browse

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Michael McCandless
Good! In fact now we see similar slowness with nio-thread vs nio-shared as we see for RAM-thread vs RAM-shared. Ie, for both RAM and NIO you get better performance sharing a single reader than reader-per-thread. This is odd -- I would have expected that with infinite RAM reader-per- thr

Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Michael McCandless
Also, by OS is the one reason we can think of now, but that doesn't mean there aren't other reasons. EG, who knows -- maybe for small indexes NIO doesn't help but for large ones it does (just an example) and so you'd want non-static choice. Mike Yonik Seeley wrote: On Wed, Nov 12, 20

Re: How to get the terms within 5 words of another term?

2008-11-12 Thread dipesh
You might want to look at the TermPositionVector. For it to work I think the TermVector themselves have to be stored with option TermVector.YES regards, Dipesh On Thu, Nov 13, 2008 at 4:26 AM, Sven <[EMAIL PROTECTED]> wrote: > Hi everyone, > > I have a term "foo" and I want to count all the o

Re: About counting term hits

2008-11-12 Thread dipesh
yes its quite possible. 1.you need to create term which you need to search. eg. Term term = new Term("yourfield","yourword"); 2. then create a TermDoc enum. TermDocs provides an interface for enumerating pairs for a term. TermDocs t = new FilterIndexReader(IndexReader.open("youindex")).termDocs(

Using AND with MultiFieldQueryParser

2008-11-12 Thread Rafael Cunha de Almeida
Hello, I used an Analyzer which removes stopwords when indexing, then I wanted to do an AND search using MultiFieldQueryParser. So I did this: word1 AND stopword AND word2 I thought the stopword would be ignored by the searcher (I use the same Analyzer to index and search). But instead, I