date:20031021

Re: positional token info

2003-10-21 Thread Tatu Saloranta

On Tuesday 21 October 2003 17:31, Otis Gospodnetic wrote:
> > It does seem handy to avoid exact phrase matches on "phone boy" when
> > a
> > stop word is removed though, so patching StopFilter to put in the
> > missing positions seems reasonable to me currently.  Any objections
> > to that?
>
> So "phone boy" would match documents containing "phone the boy"?  That

Hmmh. WWGD (What Would Google Do)? :-)

> doesn't sound right to me, as it assumes what the user is trying to do.
>  Wouldn't it be better to allow the user to decide what he wants?
> (i.e. "phone boy" returns documents with that _exact_ phrase.  "phone
> boy"~2 also returns documents containing "phone the boy").

As long as phrase queries work appropriately with approximity modifiers, one
alternative (from app standpoint) would be to:

(a) Tokenize stopwords out, adding skip value; either one per stop word,
  or one for non-empty sequence of key words ( "top of the world" might
 make sense to tokenize as "top - world", "-" signifying 'hole')
(b) With phrase queries, first do exact match.
(c) If number of matches is "too low" (whatever definition of low is),
  use phrase query match with slop of 2 instead.

Tricky part would be to do the same for combination queries, where it's
not easy to check matches for individual query components.

Perhaps it'd be possible to create Yet Another Query object, that would,
given a threshold, do one or two searches (as described above), to allow
for self-adjusting behaviour?
Or, perhaps there should be container query, that could execute ordered
sequence of sub-queries, until one returns "good enough" set of matches, then
return that set (or last result(s), if no good matches) and above-mentioned 
"sloppy if need be" phrase query would just be  a special case?

-+ Tatu +-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Otis Gospodnetic

> It does seem handy to avoid exact phrase matches on "phone boy" when
> a 
> stop word is removed though, so patching StopFilter to put in the 
> missing positions seems reasonable to me currently.  Any objections
> to that?

So "phone boy" would match documents containing "phone the boy"?  That
doesn't sound right to me, as it assumes what the user is trying to do.
 Wouldn't it be better to allow the user to decide what he wants?
(i.e. "phone boy" returns documents with that _exact_ phrase.  "phone
boy"~2 also returns documents containing "phone the boy").

Sorry if I'm misunderstanding something, long day, plus 1:30 AM.

Otis


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Erik Hatcher

On Tuesday, October 21, 2003, at 12:53  PM, Doug Cutting wrote:
If however you want "phone the boy" to match "phone X boy" where X is 
any word, then PhraseQuery would have to be extended.  It's actually a 
pretty simple extension.  Each term in a PhraseQuery corresponds to a 
PhrasePositions object.  The 'offset' field within this is the 
position of the term in the phrase.  If you construct the phrase 
positions for a two-term phrase so that the first has offset=0 and the 
second offset=2, then you'll get this sort of matching.  So all that's 
needed is a new method PhraseQuery.add(Term term, int offset), and for 
these offsets to be stored so that they can be used when building 
PhrasePositions.  Would this be a useful feature?
My questions were really from an academic understanding nature about 
position increments and how it related to searching.  I definitely 
agree (and who could argue?) with Nutch and Google!  Removing stop 
words is not a good thing, but smart handling of pervasive terms is 
important as you have implemented in Nutch when not doing phrase 
queries and how the bi-gram stuff works.

It does seem handy to avoid exact phrase matches on "phone boy" when a 
stop word is removed though, so patching StopFilter to put in the 
missing positions seems reasonable to me currently.  Any objections to 
that?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Otis Gospodnetic

I think "phone the boy" query should match exactly that, and not "phone
X boy", nor "phone boy".  To me, entering a query as a phrase query
means that the user wants to find documents with _exactly_ that
sequence of terms.

If you know that your users will be entering phrases with stop words,
then stop words should not be thrown out before indexing.

If users are really interested in terms "phone" and "boy", they should
use +phone +boy.

If they are okay with finding documents that contain the term "phone"
followed by the term "boy", even if "boy" is not the very next term
after "phone", they can use the slop factor options.

If I understand http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730
correctly, the included patch ensures that "phone boy" does not match
"phone the boy", but I am not sure about the other way around.

Otis



--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> Erik Hatcher wrote:
> > Just for fun, I've written a simple stop filter that bumps the
> position 
> > increments to account for the stop words removed:
> > 
> > But its practically impossible to formulate a Query that can take 
> > advantage of this.  A PhraseQuery, because Terms don't have
> positional 
> > info (only the transient tokens), only works using a slop factor
> which 
> > doesn't guarantee an exact match like I'm after.  A
> PhrasePrefixQuery 
> > won't work any better as there is no way to add in a "blank" term
> to 
> > indicate a missing position.
> 
> The PhraseQuery code predates the setPositionIncrement feature.
> 
> You can use your filter to find phrases that don't contain stop
> words, 
> e.g., when your filter is used, a query for the phrase "phone boy"
> won't 
> match "phone the boy", as it would with the normal stop filter, but a
> 
> query for "phone the boy" would also only match "phone boy".
> 
> One workaround is to simply not use a stop list.  Then "phone boy"
> will 
> only match "phone boy", and "phone the boy" will only match "phone
> the 
> boy", and not "phone a boy" too.  One can write a query parser which 
> removes stop words unless they're in phrases.  This is what Nutch and
> 
> Google do.
> 
> If however you want "phone the boy" to match "phone X boy" where X is
> 
> any word, then PhraseQuery would have to be extended.  It's actually
> a 
> pretty simple extension.  Each term in a PhraseQuery corresponds to a
> 
> PhrasePositions object.  The 'offset' field within this is the
> position 
> of the term in the phrase.  If you construct the phrase positions for
> a 
> two-term phrase so that the first has offset=0 and the second
> offset=2, 
> then you'll get this sort of matching.  So all that's needed is a new
> 
> method PhraseQuery.add(Term term, int offset), and for these offsets
> to 
> be stored so that they can be used when building PhrasePositions. 
> Would 
> this be a useful feature?
> 
> Doug
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene on Windows

2003-10-21 Thread Stephane Vaucher

Hi Tate (didn't know you were lurking on the list),

I've found that it's often not very clear what truly affects performance. 
Doing batch indexes with  a data set of 250,000 docs (with 10 fields each) 
on a machine with 2 Gbytes of 400 DDR RAM, I've tested a few merge factors 
to discover that it seemed optimal at 50 and even then, performance wasn't 
much better than with a MF of 20. Nowadays, there can be so many hidden 
optimisations by HDs and OSs, that it's often worth testing with each 
configuration used.

sv

On Tue, 21 Oct 2003, Tate Avery wrote:

> Doug,
> 
> Re: high merge factor.  I was building test indexes and writing out 300 segments of 
> 300 docs and merging them every 90,000 kept the 'merging' time down to a minimum 
> (for my slowish HD).
> 
> I was assuming that 11 of these large merges during the indexing of 1,000,000 docs 
> (plus a final optimize) would be faster than 10,000 little merges if the mergeFactor 
> was set to 10 (for the same corpus).
> 
> Maybe this is not the case.
> 
> 
> 
> 
> Tate
> 
> 
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: October 21, 2003 12:37 PM
> To: Lucene Users List
> Subject: Re: Lucene on Windows
> 
> 
> Tate Avery wrote:
> > You might have trouble with "too many open files" if you set your mergeFactor too 
> > high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I 
> > get a too many open files error.  Note: the default mergeFactor of 10 should give 
> > no trouble.
> 
> Please note that it is never recommended that you set mergeFactor 
> anywhere near this high.  I don't know why folks do this.  It really 
> doesn't make indexing much faster, and it makes searching slower if you 
> don't optimize.  It's a bad idea.  The default setting of 10 works 
> pretty well.  I've also had good experience setting it as high as 50 on 
> big batch indexing runs, but do not recommend setting it much higher 
> than that.  Even then, this can cause problems if you need to use 
> several indexes at once, or you have lots of fields.
> 
> Doug
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene on Windows

2003-10-21 Thread Tate Avery

Doug,

Re: high merge factor.  I was building test indexes and writing out 300 segments of 
300 docs and merging them every 90,000 kept the 'merging' time down to a minimum (for 
my slowish HD).

I was assuming that 11 of these large merges during the indexing of 1,000,000 docs 
(plus a final optimize) would be faster than 10,000 little merges if the mergeFactor 
was set to 10 (for the same corpus).

Maybe this is not the case.

Tate

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: October 21, 2003 12:37 PM
To: Lucene Users List
Subject: Re: Lucene on Windows

Tate Avery wrote:
> You might have trouble with "too many open files" if you set your mergeFactor too 
> high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I 
> get a too many open files error.  Note: the default mergeFactor of 10 should give no 
> trouble.

Please note that it is never recommended that you set mergeFactor 
anywhere near this high.  I don't know why folks do this.  It really 
doesn't make indexing much faster, and it makes searching slower if you 
don't optimize.  It's a bad idea.  The default setting of 10 works 
pretty well.  I've also had good experience setting it as high as 50 on 
big batch indexing runs, but do not recommend setting it much higher 
than that.  Even then, this can cause problems if you need to use 
several indexes at once, or you have lots of fields.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Doug Cutting

Erik Hatcher wrote:
Just for fun, I've written a simple stop filter that bumps the position 
increments to account for the stop words removed:

But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens), only works using a slop factor which 
doesn't guarantee an exact match like I'm after.  A PhrasePrefixQuery 
won't work any better as there is no way to add in a "blank" term to 
indicate a missing position.
The PhraseQuery code predates the setPositionIncrement feature.

You can use your filter to find phrases that don't contain stop words, 
e.g., when your filter is used, a query for the phrase "phone boy" won't 
match "phone the boy", as it would with the normal stop filter, but a 
query for "phone the boy" would also only match "phone boy".

One workaround is to simply not use a stop list.  Then "phone boy" will 
only match "phone boy", and "phone the boy" will only match "phone the 
boy", and not "phone a boy" too.  One can write a query parser which 
removes stop words unless they're in phrases.  This is what Nutch and 
Google do.

If however you want "phone the boy" to match "phone X boy" where X is 
any word, then PhraseQuery would have to be extended.  It's actually a 
pretty simple extension.  Each term in a PhraseQuery corresponds to a 
PhrasePositions object.  The 'offset' field within this is the position 
of the term in the phrase.  If you construct the phrase positions for a 
two-term phrase so that the first has offset=0 and the second offset=2, 
then you'll get this sort of matching.  So all that's needed is a new 
method PhraseQuery.add(Term term, int offset), and for these offsets to 
be stored so that they can be used when building PhrasePositions.  Would 
this be a useful feature?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Weird NPE in RAMInputStream when merging indices

2003-10-21 Thread petite_abeille

Hello,

What could cause such weird exception?

RAMInputStream.: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.lucene.store.RAMInputStream.(RAMDirectory.java:217)
at org.apache.lucene.store.RAMDirectory.openFile(RAMDirectory.java:182)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:78)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:116)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:378)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:298)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:313)

I don't know if this is a one off as I cannot reproduce this problem 
nor I have seen this before, but I thought I could as well ask.

This is triggered by merging a RAMDirectory into a FSDirectory. Looking 
at the RAMDirectory source code, this exception seems to indicate that 
the file argument to the RAMInputStream constructor is null... how 
could that ever happen?

Here is the code which triggers this weirdness:

this.writer().addIndexes( new Directory[] { aRamDirectory } );

The RAM writer is checked before invoking this code to make sure there 
is some content in the RAM directory:

aRamWriter.docCount() > 0

This has been working very reliably since the dawn of time, so I'm a 
little bit at loss as how to diagnose this weird exception...

Any ideas?

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene on Windows

2003-10-21 Thread Doug Cutting

Tate Avery wrote:
You might have trouble with "too many open files" if you set your mergeFactor too high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I get a too many open files error.  Note: the default mergeFactor of 10 should give no trouble.
Please note that it is never recommended that you set mergeFactor 
anywhere near this high.  I don't know why folks do this.  It really 
doesn't make indexing much faster, and it makes searching slower if you 
don't optimize.  It's a bad idea.  The default setting of 10 works 
pretty well.  I've also had good experience setting it as high as 50 on 
big batch indexing runs, but do not recommend setting it much higher 
than that.  Even then, this can cause problems if you need to use 
several indexes at once, or you have lots of fields.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Expression Extractions

2003-10-21 Thread MOYSE Gilles (Cetelem)

I've found something about expression extractions (the ability , when a word
and another appear frequently side-by-side, to detect that they form an
expression) : http://www.miv.t.u-tokyo.ac.jp/papers/matsuoFLAIRS03.pdf

Gilles Moyse

Re: Hierarchical document

2003-10-21 Thread Peter Keegan

One way to implement hierarchical documents is through the use of
predefined phrases. Consider the 2 hierarchies:

1. Kids_and_Teens/Computers/Software/Games
2. Computers/Software/Freeware

When indexing a document belonging to (1), add these terms in consecutive
order (autoincrement=1): "dir:Top dir:Kids_and_Teens dir:Computers
dir:Software dir:Games dir:Bottom"

For documents belonging to (2), add: "dir:Top dir:Computers dir:Software
dir:Bottom"

The terms "dir:Top" and "dir:Bottom" can be used to anchor a query
to a specific portion of the hierachy.

Thus, a query containing the phrase: "dir:Computers dir:Software" would
match documents in both (1) and (2) (and perhaps others), but a query for:
"dir:Top dir:Kids_and_Teens dir:Computers dir:Software" would target only
'Computer/Software' documents from the 'Kids_and_Teens' top level directory.
(The QueryPhrase 'slop factor' would be set to 0).

Peter

- Original Message - 
From: "Tatu Saloranta" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, October 20, 2003 8:24 PM
Subject: Re: Hierarchical document


> On Monday 20 October 2003 10:31, Erik Hatcher wrote:
> > On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
> > There is not a more "lucene" way to do this - its really up to you to
> > be creative with this.  I'm sure there are folks that have implemented
> > something along these lines on top of Lucene.  In fact, I have a
> > particular interest in doing so at some point myself.  This is very
> > similar to the object-relational issues surrounding relational
> > databases - turning a pretty flat structure into an object graph.
> > There are several ideas that could be explored by playing tricks with
> > fields, such as giving them a hierarchical naming structure and
> > querying at the level you like (think Field.Keyword and PrefixQuery,
> > for example), and using a field to indicate type and narrowing queries
> > to documents of the desired type.
> >
> > I'm interested to see what others have done in this area, or what ideas
> > emerge about how to accomplish this.
>
> I'm planning to do something similar. In my case problem is bit simpler;
> documents have associated products, and products form a hierarchy.
> Searches should be able to match not only direct matches (searching
> product, article associated with product), but also indirect ones via
> membership (product member of a product group, matching group).
> Product hierarchy also has variable depth.
>
> To do searches using non-leaf hierarchy items (groups), all actual product
> items/groups associated with docs are expanded to full ids when
> indexing (ie. they contain path from root, up to and including node,
> each node component having its own unique id).
> Thus, when searching for an intermediate node (product grouping),
> match occurs since that node id is part of path to products that are in
> the group (either directly or as members of sub-groups).
>
> Since no such path is stored (directly) in database, this also allows me
to do
> queries that would be impossible to do in database (I could add similar
> path/full id fields for search purposes of course). Thus, Lucene index is
> optimized for searching purposes, and database structure for editing
> and retrieval of data.
>
> Another thing to keep in mind is that at least for metadata it may make
sense
> to use specialized analyzer, one that allows tokenizing using specific ids
> to store ids as separate tokens; instead of using some standard plain text
> analyzer. This way it is possible to separate ids from textual words (by
> using prefixes, for example, "@1253" or "#13945"); this allows for
accurate
> matching based on identity of associated metadata selections.
>
> -+ Tatu +-
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Steve Rowe

Erik,

I've submitted a patch (BUG# 23730) very similar to yours, in response 
to a request to fix phrases matching where they should not:

http://mail-archive.com/[EMAIL PROTECTED]/msg04349.html>

Bug #23730:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730>
> But, how would you actually *use* an index that was indexed with the
> holes noted by > 1 position increments?
As the lucene-user email linked above notes, setting the position 
increment interdicts false phrase matching.

Steve Rowe

Erik Hatcher wrote:
On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the 
same word.

Right.  Like I said, I recognize the benefits of using a position 
increment of 0.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?

Who knows ? I may be interesting  to keep track of the *presence* of 
"empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
blue", "[the] sky [is] [that] [really] blue". The traditionnal 
reduction to "sky blue" is maybe over-simplistic for some cases...

But, how would you actually *use* an index that was indexed with the 
holes noted by > 1 position increments?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Erik Hatcher

On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:
The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the 
same word.
Right.  Like I said, I recognize the benefits of using a position 
increment of 0.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?
Who knows ? I may be interesting  to keep track of the *presence* of 
"empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
blue", "[the] sky [is] [that] [really] blue". The traditionnal 
reduction to "sky blue" is maybe over-simplistic for some cases...
But, how would you actually *use* an index that was indexed with the 
holes noted by > 1 position increments?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene on Windows

2003-10-21 Thread Otis Gospodnetic

A very rough and simple 'add a single document to the index' test shows
that the Compound Index is marginally slower than the traditional one.
I did not test searching.

Otis

--- Eric Jain <[EMAIL PROTECTED]> wrote:
> > The CVS version of Lucene has a patch that allows one to use a
> > 'Compound Index' instead of the traditional one.  This reduces the
> > number of open files.  For more info, see/make the Javadocs for
> > IndexWriter.
> 
> Interesting option. Do you have a rough idea of what the performance
> impact of using this setting is?
> 
> --
> Eric Jain
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

how to count stop words

2003-10-21 Thread le Na

Hi all,
For my job, in indexing stage, I would like to keep stop words such as the, with, of, 
by, etc as normal words. I did this by instantiating a standardAnalyzer object (in 
INdexHTML program) with an empty stopword array. But it seems to me that the IndexHTML 
still sweeps off the stop words. Any help, please.
 
Thanks
 
T Le



-
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Compound expression extraction

2003-10-21 Thread MOYSE Gilles (Cetelem)

Hi.

I'm trying to extract expressions from the terms position information, i.e.,
if two words appears frequently side-by-side, then we can consider that the
two words are only one. For instance, 'Object' and 'Oriented' appears
side-by-side 9 times out of 10. It allows us to define a new expression,
'Object_Oriented'.
Does anyone knows the statistical method to detect such expressions ?

Thanks.

Gilles Moyse

-Message d'origine-
De : Eric Jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 21 octobre 2003 09:24
À : Lucene Users List
Objet : Re: Lucene on Windows


> The CVS version of Lucene has a patch that allows one to use a
> 'Compound Index' instead of the traditional one.  This reduces the
> number of open files.  For more info, see/make the Javadocs for
> IndexWriter.

Interesting option. Do you have a rough idea of what the performance
impact of using this setting is?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

2003-10-21 Thread Pierrick Brihaye

Hi,

Erik Hatcher a écrit:

Is anyone doing anything interesting with the Token.setPositionIncrement 
during analysis?
I think so :-) Well... my arabic analyzer is based on this functionnality.

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the same 
word.

But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens)
Correct !

I've made a dirty patch for the QueryParser which is able to handle 
tokens with positionIncrement equal to 0 or 1 (see bug #23307). It still 
needs some work, but it fits my needs :-)

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?
Who knows ? I may be interesting  to keep track of the *presence* of 
"empty words", e.g. "[the] sky [is] blue", "[the] sky [is] [really] 
blue", "[the] sky [is] [that] [really] blue". The traditionnal reduction 
to "sky blue" is maybe over-simplistic for some cases...

Well, just an idea.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene on Windows

2003-10-21 Thread Eric Jain

> The CVS version of Lucene has a patch that allows one to use a
> 'Compound Index' instead of the traditional one.  This reduces the
> number of open files.  For more info, see/make the Javadocs for
> IndexWriter.

Interesting option. Do you have a rough idea of what the performance
impact of using this setting is?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

Re: positional token info

Re: positional token info

Re: positional token info

RE: Lucene on Windows

RE: Lucene on Windows

Re: positional token info

Weird NPE in RAMInputStream when merging indices

Re: Lucene on Windows

Expression Extractions

Re: Hierarchical document

Re: positional token info

Re: positional token info

Re: Lucene on Windows

how to count stop words

Compound expression extraction

Re: positional token info

Re: Lucene on Windows

18 matches

Site Navigation

Mail list logo

Footer information