date:20090428

Re: new TokenStream api Question

2009-04-28 Thread Michael Busch


Hi Eks Dev,

I actually started experimenting with changing the new API slightly to 
overcome one drawback: with the variables now distributed over various 
Attribute classes (vs. being in a single class Token previously), 
cloning a "Token" (i.e. calling captureState()) is more expensive. This 
slows down the CachingTokenFilter and Tee/Sink-TokenStreams.


So I was thinking about introducing interfaces for each of the 
Attributes. E.g. OffsetAttribute would then be an interface with all 
current methods, and OffsetAttributeImpl would be its implementation. 
The user would still use the API in exactly the same way as now, that is 
be e.g. calling addAttribute(OffsetAttribute.class), and the code takes 
care of instantiating the right class. However, there would then also be 
an API to pass in an actual instance, and this API would use reflection 
to find all interfaces that the instances implements. All of those 
interfaces that extend the Attribute interface would be added to the 
AttributeSource map, with the instance as the value.


Then the Token class would implement all six attribute interfaces. An 
expert user could decide to pass in a Token instance instead of calling 
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that 
needs to be cloned in captureState(), making cloning much faster. And a 
(probably also expert) user could even implement an own class that 
implements exactly the necessary interfaces (maybe only 3 of the 6 
provided), and make cloning faster than it is even with the old 
Token-based API.


And of course also in your case could you just create a different 
implementation of such an interface, right? I think what's nice about 
this change is that it doesn't make it more complicated to use the 
TokenStream API, and the indexing pipeline still uses it the same way 
too, yet it's more extensible more expert users and possible to achieve 
the same or even better cloning performance.


I will open a new Jira issue for this soon. But I'd be happy to hear 
feedback about the proposed changes, and especially if you think these 
changes would help you for your usecase.


-Michael

On 4/27/09 1:49 PM, eks dev wrote:

Should I create a patch with something like this?

With "Expert" javadoc, and explanation what is this good for should be a nice 
addition to Attribute cases.
Practically, it would enable specialization of "hard linked" Attributes like 
TermAttribute.

The only preconditions are:

- "Specialized Attribute" must extend one of the "hard linked" ones, and 
provide class of it
- Must implement default constructor
- should extend by not introducing state (big majority of cases) (not to break 
captureState())

The last one could be relaxed i guess, but I am not yet 100% familiar with this 
code.

Use cases for this are along the lines of my example, smaller, easier user code 
and performance (token filters mainly)



- Original Message 
   

From: Uwe Schindler
To: java-dev@lucene.apache.org
Sent: Sunday, 26 April, 2009 23:03:06
Subject: RE: new TokenStream api Question

There is one problem: if you extend TermAttribute, the class is different
(which is the key in the attributes list). So when you initialize the
TokenStream and do a

YourClass termAtt = (YourClass) addAttribute(YourClass.class)

...you create a new attribute. So one possibility would be to also specify
the instance and save the attribute by class (as key), but with your
instance. If you are the first one that creates the attribute (if it is a
token stream and not a filter it is ok, you will be the first, it adding the
attribute in the ctor), everything is ok. Register the attribute by yourself
(maybe we should add a specialized addAttribute, that can specify a instance
as default)?:

YourClass termAtt = new YourClass();
attributes.put(TermAttribute.class, termAtt);

In this case, for the indexer it is a standard TermAttribute, but you can
more with it.

Replacing TermAttribute by an own class is not possible, as the indexer will
get a ClassCastException when using the instance retrieved with
getAttribute(TermAttribute.class).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 

-Original Message-
From: eks dev [mailto:eks...@yahoo.co.uk]
Sent: Sunday, April 26, 2009 10:39 PM
To: java-dev@lucene.apache.org
Subject: new TokenStream api Question


I am just looking into new TermAttribute usage and wonder what would be
the best way to implement PrefixFilter that would filter out some Terms
that have some prefix,

something like this, where '-' represents my prefix:

   public final boolean incrementToken() throws IOException {
 // the first word we found
 while (input.incrementToken()) {
   int len = termAtt.termLength();

   if(len>  0&&  termAtt.termBuffer()[0]!='-') //only length>  0 and
non LFs
 return true;
   // note: el

Re: RangeQuery and getTerm

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 2:38 AM, Uwe Schindler  wrote:

> Why not deprecate getTerm() in MultiTermQuery, remove the field in
> MultiTermQuery and all related occurrences? The field and methods are then
> *not* deprecated and senseful implemented in Fuzzy*.

+1

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Fwd: Build failed in Hudson: Lucene-trunk #810

2009-04-28 Thread Michael McCandless

Hmm -- this failed because the host "downloads.osafoundation.org"
fails to resolve.  The contrib/db tests need to download the Berkeley
DB JARs from here.

Andi any idea what's up w/ that?  Do we need to set a different
download location?

Mike

-- Forwarded message --
From: Apache Hudson Server 
Date: Mon, Apr 27, 2009 at 10:14 PM
Subject: Build failed in Hudson: Lucene-trunk #810
To: java-dev@lucene.apache.org


See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/810/changes

Changes:

[mikemccand] LUCENE-1615: remove some more deprecated uses of Fieldable.omitTf

[mikemccand] remove redundant CHANGES entries from trunk if they are
already covered in 2.4.1

--
[...truncated 2887 lines...]
compile-test:
    [echo] Building benchmark...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-demo:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile-demo:

compile-highlighter:
    [echo] Building highlighter...

build-memory:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile:

check-files:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
   [javac] Compiling 9 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
   [javac] Note:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/quality/TestQualityRun.java
 uses or overrides a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.
    [copy] Copying 2 files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test

build-artifacts-and-tests:
    [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
    [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
   [javac] Compiling 16 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
   [javac] Note: Some input files use or override a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
   [javac] Compiling 4 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
   [javac] Note: Some input files use or override a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

jar-core:
     [jar] Building jar:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/lucene-collation-2.4-SNAPSHOT.jar

jar:

compile-test:
    [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
    [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
   [javac] Compiling 5 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
   [javac] Note:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java
 uses or overrides a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

build-artifacts-and-tests:

bdb:
    [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

contrib-build.init:

get-db-jar:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib
     [get] Getting: http://downloads.osafoundation.org/db/db-4.7.25.jar
     [get] To: 
http://hudson.zones.apache.org/hudson/job/Lucene-

[jira] Assigned: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1619:
--

Assignee: Michael McCandless

> TermAttribute.termLength() optimization
> ---
>
> Key: LUCENE-1619
> URL: https://issues.apache.org/jira/browse/LUCENE-1619
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-1619.patch
>
>
>public int termLength() {
>  initTermBuffer(); // This patch removes this method call 
>  return termLength;
>}
> I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
> could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703537#action_12703537
 ] 

Michael McCandless commented on LUCENE-1619:


Indeed it seems unnecessary -- I'll commit.  Thanks Eks!

> TermAttribute.termLength() optimization
> ---
>
> Key: LUCENE-1619
> URL: https://issues.apache.org/jira/browse/LUCENE-1619
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-1619.patch
>
>
>public int termLength() {
>  initTermBuffer(); // This patch removes this method call 
>  return termLength;
>}
> I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
> could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: new TokenStream api Question

2009-04-28 Thread Uwe Schindler

Haha, isn't it funny, the same idea came to me on Sunday afternoon after I
answered to Eks Dev. But I have thrown it away, because interfaces are not
liked here. :-)

 

This new interface may also prevent us from using these useNewAPI() calls,
as the old TokenStream methods could be easily implemented/wrapped using the
standard Token instance, too. About the "interface problem": We do not have
to think about interface extensions in future. If one needs a new attribute
member, he can just invent a new Attribute and add it (like ShiftAttribute
in TrieRange). An interface once defined, does not need to be changed
anymore.

 

The new API then needs some "factory" to generate the attribute instances,
e.g. if one adds all 4 attributes (term, posincr, offset, type), only one
instance must be created and all mappings in the interface point to this
instance. Do you have an idea, how to implement this?  It should be
extensible, so each TokenStream can register its own factory, but maybe
defaults to something etc.

 

+1

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Michael Busch [mailto:busch...@gmail.com] 
Sent: Tuesday, April 28, 2009 10:23 AM
To: java-dev@lucene.apache.org
Subject: Re: new TokenStream api Question

 

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to
overcome one drawback: with the variables now distributed over various
Attribute classes (vs. being in a single class Token previously), cloning a
"Token" (i.e. calling captureState()) is more expensive. This slows down the
CachingTokenFilter and Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes.
E.g. OffsetAttribute would then be an interface with all current methods,
and OffsetAttributeImpl would be its implementation. The user would still
use the API in exactly the same way as now, that is be e.g. calling
addAttribute(OffsetAttribute.class), and the code takes care of
instantiating the right class. However, there would then also be an API to
pass in an actual instance, and this API would use reflection to find all
interfaces that the instances implements. All of those interfaces that
extend the Attribute interface would be added to the AttributeSource map,
with the instance as the value.

Then the Token class would implement all six attribute interfaces. An expert
user could decide to pass in a Token instance instead of calling
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that needs to
be cloned in captureState(), making cloning much faster. And a (probably
also expert) user could even implement an own class that implements exactly
the necessary interfaces (maybe only 3 of the 6 provided), and make cloning
faster than it is even with the old Token-based API.

And of course also in your case could you just create a different
implementation of such an interface, right? I think what's nice about this
change is that it doesn't make it more complicated to use the TokenStream
API, and the indexing pipeline still uses it the same way too, yet it's more
extensible more expert users and possible to achieve the same or even better
cloning performance.

I will open a new Jira issue for this soon. But I'd be happy to hear
feedback about the proposed changes, and especially if you think these
changes would help you for your usecase.

-Michael

On 4/27/09 1:49 PM, eks dev wrote: 

Should I create a patch with something like this? 
 
With "Expert" javadoc, and explanation what is this good for should be a
nice addition to Attribute cases.
Practically, it would enable specialization of "hard linked" Attributes like
TermAttribute. 
 
The only preconditions are: 
 
- "Specialized Attribute" must extend one of the "hard linked" ones, and
provide class of it
- Must implement default constructor 
- should extend by not introducing state (big majority of cases) (not to
break captureState())
 
The last one could be relaxed i guess, but I am not yet 100% familiar with
this code.
 
Use cases for this are along the lines of my example, smaller, easier user
code and performance (token filters mainly)
 
 
 
- Original Message 
  

From: Uwe Schindler   
To: java-dev@lucene.apache.org
Sent: Sunday, 26 April, 2009 23:03:06
Subject: RE: new TokenStream api Question
 
There is one problem: if you extend TermAttribute, the class is different
(which is the key in the attributes list). So when you initialize the
TokenStream and do a
 
YourClass termAtt = (YourClass) addAttribute(YourClass.class)
 
...you create a new attribute. So one possibility would be to also specify
the instance and save the attribute by class (as key), but with your
instance. If you are the first one that creates the attribute (if it is a
token stream and not a filter it is ok, you will be the first, it adding the
att

[jira] Resolved: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1619.


   Resolution: Fixed
Fix Version/s: 2.9

> TermAttribute.termLength() optimization
> ---
>
> Key: LUCENE-1619
> URL: https://issues.apache.org/jira/browse/LUCENE-1619
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1619.patch
>
>
>public int termLength() {
>  initTermBuffer(); // This patch removes this method call 
>  return termLength;
>}
> I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
> could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703540#action_12703540
 ] 

Shai Erera commented on LUCENE-1593:


bq. I think I'd lean towards the 12 impls now. They are tiny classes.

If we resolve everything else, that should not hold us back. I'll do it.

bq. We can mull it over some more... sleep on it.

Ok sleeping did help. Originally I thought that the difference between our 
thinking is that you think that PQ should know how to construct a sentinel 
object, while I thought the code which uses PQ should know that. Now I realize 
both are true - the code which uses PQ, or at least instantiates PQ, already 
*knows* how to create those sentinel objects, since it determines which PQ impl 
to instantiate. I forgot for a moment that PQ is not a concrete class, and 
anyone using it should create his own specialized PQ, or reuse an existing one, 
but anyway that specialized PQ should know how to create the sentinel objects 
and compare them to real objects.

So I'm ok with it - I'll make the following changes, as you suggest:
# Add a protected getSentinelObject() which returns null. Use it in PQ.init() 
to fill the queue if it's not null.
# Make the necessary changes to HitQueue.
# Remove the addSentinelObjects from PQ and the code from TSDC.

BTW, we should be aware that this means anyone using HitQueue needs to know 
that upon initialization it's filled with sentinel objects, and that its size() 
will be maxSize etc. Since HQ is package private I don't have a problem with 
it. Generally speaking, the code which instantiates a PQ and the code that uses 
it must be in sync ... i.e., if I instantiate a PQ and pass it to some other 
code which just receives a PQ and adds elements to it, that code should not 
rely on size() being smaller or anything. I don't feel it complicates things 
... and anyway someone can always create a PQ impl which receives a boolean 
that determines whether sentinel objects should be created or not and if not 
return null in its getSentinelObject().

bq. Maybe we should add a "docsInOrder()" method to Scorer?

I'm not sure that will solve it .. BS2 consults its allowDocsOutOfOrder only if 
score(Collector) is called, which it then instantiates a BS and delegates the 
score(Collector) to. So suppose that BS.docsInOrder return false, what will BS2 
return? Remember that it may be used by IndexSearcher in two modes: (1) without 
a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). So 
it cannot consult its own allowDocsOutOfOrder (even though it gets it as a 
parameter) since depending on how it will be used, the answer is different.
BTW, IndexSearch.doSearch creates the Scorer, but already receives the 
Collector as argument, therefore at this point it's too late to make any 
decisions regarding orderness of docs, no?

There are few issues that we need to solve:
# A user can set BooleanQuery.setAllowDocsOutOfOrder today, which may trigger 
BS2.score(Collector) to instantiate BS, which may screw up the Collector's 
logic if it assumes in-order documents. IndexSearcher creates the Collector 
before it knows whether BQ is used or not so it cannot do any intelligent 
checks. I see two possible solutions, which only 1 of them may be implemented 
now and the other in 3.0:
## Add docsInOrder to Weight (it's an interface, therefore just in 3.0), since 
that seems to allow IS to check if the current query may visit documents 
out-of-order.
##* Actually, maybe we add it to Query, which is abstract, and in IS we do 
weight.getQuery().docsInOrder()?
## In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create 
out-of-order collectors. That might impact performance if there are no BQ 
clauses, but I assume it is not used much? And this doesn't break back-compat 
since that's the only way to instantiate an out-of-order Scorer today (besides 
creating your own).
# Someone can create his own Collector, but will have no way to know if the 
docs will be sent in-order or not. Whatever we do, we have to allow people to 
correctly 'predict' the behavior of their Collectors, that's why I like the BQ 
static setting of that variant. The user is the one that sets it to true, so 
he/she should know that and create their appropriate Collector instance.
#* On the other hand, if we choose to add that information to Query, those 
Collectors may not have that information in hand when they are instantiated ...

So I'm torn here. Adding that information to Query will solve it for those that 
use the convenient search methods (i.e., those that don't receive a Collector), 
but provide their own Query impl, since if we add a default impl to Query which 
returns false (i.e., out-of-order), it should not change the behavior for them. 
And if they always return docs in-order, they can override it to return true.

About those that pass in Collector ...

ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Shai Erera

Hi

I think I've hit a bug in ConcurrentMergeScheduler, but I'd like those who
are more familiar with the code to review it. I ran
TestStressSort.testSort() and started to get AIOOB exceptions from
MergeThread, the CPU spiked to 98-100% and did not end for a couple of
minutes, until I was able to regain control and kill the process (looks like
an infinite loop).

To reproduce it all you need is to add the following line to
PQ.initialize(): size = maxSize, and then you'll get the aforementioned
exceptions. I did it acceindentally, but I'm sure there's a way to reproduce
it with a JUnit test or something so that it will happen consistently.

When I debugged-trace the test, I noticed that MergeThread are just spawned
forever. The reason is this: In CMS.merge(IndexWriter) there's a 'while
(true)' loop which does 'while (mergeThreadCount() >= maxThreadCount)' and
if false just spawns a new MergeThread. On the other hand, in
MergeThread.run there's a try-finally which executes whatever it needs to
execute and in the finally block removes this thread from the list of
threads. That causes CMS to spawn a new thread, which will hit another
exception, remove itself from the queue and CMS will spawn a new thread.
That puts the code into an infinite loop.

That sounds like a bug to me ... I think that if MergeThread hits any
exception, the merge should fail? Anyway, the exception is added to an
exceptions List, which is a private member of CMS but is never chceked by
CMS. Perhaps merge(IndexWriter) should check if the exceptions list is not
empty and fail the merge in such case?

Anyway, I'll fix PQ's code now to continue my work, but if you want to
reproduce it, it's as easy as adding size = maxSize to initialize() and run
TestStressSort.

I don't mind to open an issue and fix it (though I'm not sure what should
the fix be at the moment, but I'll figure it out), but it will have to wait,
so if you know the code and can put a patch together quickly, don't wait up
for me :)

Shai

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703570#action_12703570
 ] 

Michael McCandless commented on LUCENE-1593:


bq. Ok sleeping did help.

OK...good morning (afternoon)!

bq. BTW, we should be aware that this means anyone using HitQueue needs to know 
that upon initialization it's filled with sentinel objects, and that its size() 
will be maxSize etc. Since HQ is package private I don't have a problem with it.

Good point -- can you update HitQueue's javadocs stating this new
behavior?

bq. BTW, IndexSearch.doSearch creates the Scorer, but already receives the 
Collector as argument, therefore at this point it's too late to make any 
decisions regarding orderness of docs, no?

Urgh yeah right.  So docsInOrder() should be added to Weight or Query
I guess.  Maybe name it "scoresDocsInOrder()"?

bq. Add docsInOrder to Weight (it's an interface, therefore just in 3.0)

Aside: maybe for 3.0 we should do a hard cutover of any remaining
interfaces, like Weight, Fieldable (if we don't replace it), etc. to
abstract classes?

bq. Remember that it may be used by IndexSearcher in two modes: (1) without a 
filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo().

I'd really love to find a way to make this more explicit.  EG you ask
the Weight for a topScorer() vs a scorer(), or something.  Clearly the
caller of .scorer() knows full well how the instance will be used (top
or not)... we keep struggling with BS/2 because this information is
not explicit now.

This would also enable BQ.scorer() to directly return a BS vs BS2,
rather than the awkward BS2 wrapping a BS internally in its
score(Collector) method.

So how about adding a "topScorer()" method, that defaults to scorer()?
(Ugh, we can't do that until 3.0 since Weight is an interface).

But actually: the thing calling scoresDocsInOrder will in fact only be
calling that method if it intends to use the scorer as a toplevel
scorer (ie, it will call scorer.score(Collector), not
scorer.next/skipTo)?  So BQ.scoresDocsInOrder would first check if
it's gonna defer to BS (it's a simple OR query) and then check BS's
static setting.

bq. In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create 
out-of-order collectors. That might impact performance if there are no BQ 
clauses, but I assume it is not used much? And this doesn't break back-compat 
since that's the only way to instantiate an out-of-order Scorer today (besides 
creating your own).

This back compat break worries me a bit.  EG maybe Solr has some
scorers that run out-of-order?

Also this is not really a clean solution: sure, it's only BQ today
that does out-of-order scoring, but I can see others doing it in the
future.  I can also see making out-of-order-scoring more common and in
fact the default (again) for BQ, since it does give good performance
gains.  Maybe other Query scorers should use it too.

So I think I'd prefer the "add scoresDocsOutOfOrder" to Query.  Note
that this must be called on the rewritten Query.

bq. And since they can always create the Query and only then create the 
Collector, if we add that info to Query they should have enough information at 
hand to create the proper Collector instance.

Right, adding the method to Query gives expert users the tools needed
to make their own efficient collectors.

bq. If we do add it to Query, then I'd like to deprecate BQ's static setter and 
getter of that attribute and provide a docsInOrder() impl, but we need to 
resolve how it will know whether it will use BS or BS2.

OK, you mean make this a non-static setter/getter?  I would actually
prefer to default it to "true", as well.  (It's better performance).
But that should wait for 3.0 (be sure to open a "for 3.0" followon
issue for this one, too!).

BTW, even though BS does score docs out of order, it still visits docs
in order "by bucket".  Meaning when visiting docs, you know the docIDs
are always greater than the last bucket's docIDs.

This gives us a source of optimization: if we hold onto the "bottom
value in the pqueue as of the last bucket", and then we can first
compare that bottom value (with no tie breaking by docID needed) and
if that competes, then we do the "true, current bottomValue with
breaking tie by docID" comparison to see if it makes it into the
queue.  But I'm not sure how we'd cleanly model this "docIDs are in
order by bucket" case of the scorer, and take advantage of that during
collection.  I think it'd require extending the FieldComparator API
somehow, eg "get me a BottomValueComparator instance as of right
now".

This can also be the basis for a separate strong optimization, which
is down in the TermScorer for a BooleanScorer/2, skip a docID if its
field value is not competitive.  This is a bigger change (way outside
the scope of this issue, and in fact more related to LUCENE-1536).


> Optimizat

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703588#action_12703588
 ] 

Shai Erera commented on LUCENE-1593:


bq. Good point - can you update HitQueue's javadocs stating this new behavior?

I actually prefer to add a boolean prePopulate to HitQueue's ctor. Since it's 
pacakge-private it should be ok and I noticed it is used in several classes 
like MultiSearcher, ParallelMultiSearcher which rely on its size() value. They 
just collect the results from the different searchers. I don't think they 
should be optimized to pre-populate the queue, so instead of changing their 
code to not rely on size(), I prefer them to init HQ with prePopulate=false.

If changing the ctor is not accepted, then perhaps add another ctor which 
accepts that argument?

Regarding the rest, I need to read them carefully before I'll comment.

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: new TokenStream api Question

2009-04-28 Thread eks dev

Hi Michael,
Sure, the Interfaces are solution to this. They define what Lucene core expects 
from these entities and gives freedom to people to provide any implementation 
they wish. E.g.  users that do not need Offset information, can just provide 
dummy implementation that returns constants... 

The only problem with Interfaces is back compatibility curse :)  

But!
 Attribute Offset is simple enough entity, so I do not believe there is a need 
ever to change an interface 
Term is just char[] with offset/length , the same. 

Having really simple (and keeping them simple)  concepts behind  makes 
Interfaces possible... I see no danger. But as said, the concepts behind must 
remain simple.
  

And by the way, I like the new API.  

Cheers, Eks




From: Michael Busch 
To: java-dev@lucene.apache.org
Sent: Tuesday, 28 April, 2009 10:22:45
Subject: Re: new TokenStream api Question

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to overcome 
one drawback: with the variables now distributed over various Attribute classes 
(vs. being in a single class Token previously), cloning a "Token" (i.e. calling 
captureState()) is more expensive. This slows down the CachingTokenFilter and 
Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes. E.g. 
OffsetAttribute would then be an interface with all current methods, and 
OffsetAttributeImpl would be its implementation. The user would still use the 
API in exactly the same way as now, that is be e.g. calling 
addAttribute(OffsetAttribute.class), and the code takes care of instantiating 
the right class. However, there would then also be an API to pass in an actual 
instance, and this API would use reflection to find all interfaces that the 
instances implements. All of those interfaces that extend the Attribute 
interface would be added to the AttributeSource map, with the instance as the 
value.

Then the Token class would implement all six attribute interfaces. An expert 
user could decide to pass in a Token instance instead of calling 
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that needs to be 
cloned in captureState(), making cloning much faster. And a (probably also 
expert) user could even implement an own class that implements exactly the 
necessary interfaces (maybe only 3 of the 6 provided), and make cloning faster 
than it is even with the old Token-based API.

And of course also in your case could you just create a different 
implementation of such an interface, right? I think what's nice about this 
change is that it doesn't make it more complicated to use the TokenStream API, 
and the indexing pipeline still uses it the same way too, yet it's more 
extensible more expert users and possible to achieve the same or even better 
cloning performance.

I will open a new Jira issue for this soon. But I'd be happy to hear feedback 
about the proposed changes, and especially if you think these changes would 
help you for your usecase.

-Michael

On 4/27/09 1:49 PM, eks dev wrote: 
Should I create a patch with something like this? With "Expert" javadoc, 
and explanation what is this good for should be a nice addition to Attribute 
cases.  Practically, it would enable specialization of "hard linked" Attributes 
like TermAttribute. The only preconditions are: - "Specialized 
Attribute" must extend one of the "hard linked" ones, and provide class of it  
- Must implement default constructor   - should extend by not introducing state 
(big majority of cases) (not to break captureState())The last one could be 
relaxed i guess, but I am not yet 100% familiar with this code.Use cases 
for this are along the lines of my example, smaller, easier user code and 
performance (token filters mainly)- Original Message 
From: Uwe Schindler   To: java-dev@lucene.apache.org  Sent: 
Sunday, 26 April, 2009 23:03:06  Subject: RE: new TokenStream api Question
There is one problem: if you extend TermAttribute, the class is different  
(which is the key in the attributes list). So when you initialize the  
TokenStream and do aYourClass termAtt = (YourClass) 
addAttribute(YourClass.class)...you create a new attribute. So one 
possibility would be to also specify  the instance and save the attribute by 
class (as key), but with your  instance. If you are the first one that creates 
the attribute (if it is a  token stream and not a filter it is ok, you will be 
the first, it adding the  attribute in the ctor), everything is ok. Register 
the attribute by yourself  (maybe we should add a specialized addAttribute, 
that can specify a instance  as default)?:YourClass termAtt = new 
YourClass();  attributes.put(TermAttribute.class, termAtt);In this case, for
 the indexer it is a standard TermAttribute, but you can  more with it.
Rep

[jira] Commented: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703543#action_12703543
 ] 

Eks Dev commented on LUCENE-1619:
-

thanks Mike

> TermAttribute.termLength() optimization
> ---
>
> Key: LUCENE-1619
> URL: https://issues.apache.org/jira/browse/LUCENE-1619
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1619.patch
>
>
>public int termLength() {
>  initTermBuffer(); // This patch removes this method call 
>  return termLength;
>}
> I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
> could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: new TokenStream api Question

2009-04-28 Thread Michael McCandless

This sounds like a good change!

Then we'd un-deprecate Token?  We could in fact then fix all core
tokenizers to use Tokens again.

I think given how simple these interfaces would be, it's an OK
situation to use interfaces? (Ie we disregard the normal back-compat
curse with interfaces).

Mike

On Tue, Apr 28, 2009 at 4:22 AM, Michael Busch  wrote:
> Hi Eks Dev,
>
> I actually started experimenting with changing the new API slightly to
> overcome one drawback: with the variables now distributed over various
> Attribute classes (vs. being in a single class Token previously), cloning a
> "Token" (i.e. calling captureState()) is more expensive. This slows down the
> CachingTokenFilter and Tee/Sink-TokenStreams.
>
> So I was thinking about introducing interfaces for each of the Attributes.
> E.g. OffsetAttribute would then be an interface with all current methods,
> and OffsetAttributeImpl would be its implementation. The user would still
> use the API in exactly the same way as now, that is be e.g. calling
> addAttribute(OffsetAttribute.class), and the code takes care of
> instantiating the right class. However, there would then also be an API to
> pass in an actual instance, and this API would use reflection to find all
> interfaces that the instances implements. All of those interfaces that
> extend the Attribute interface would be added to the AttributeSource map,
> with the instance as the value.
>
> Then the Token class would implement all six attribute interfaces. An expert
> user could decide to pass in a Token instance instead of calling
> addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
> Then the attribute source would only contain a single instance that needs to
> be cloned in captureState(), making cloning much faster. And a (probably
> also expert) user could even implement an own class that implements exactly
> the necessary interfaces (maybe only 3 of the 6 provided), and make cloning
> faster than it is even with the old Token-based API.
>
> And of course also in your case could you just create a different
> implementation of such an interface, right? I think what's nice about this
> change is that it doesn't make it more complicated to use the TokenStream
> API, and the indexing pipeline still uses it the same way too, yet it's more
> extensible more expert users and possible to achieve the same or even better
> cloning performance.
>
> I will open a new Jira issue for this soon. But I'd be happy to hear
> feedback about the proposed changes, and especially if you think these
> changes would help you for your usecase.
>
> -Michael
>
> On 4/27/09 1:49 PM, eks dev wrote:
>
> Should I create a patch with something like this?
>
> With "Expert" javadoc, and explanation what is this good for should be a
> nice addition to Attribute cases.
> Practically, it would enable specialization of "hard linked" Attributes like
> TermAttribute.
>
> The only preconditions are:
>
> - "Specialized Attribute" must extend one of the "hard linked" ones, and
> provide class of it
> - Must implement default constructor
> - should extend by not introducing state (big majority of cases) (not to
> break captureState())
>
> The last one could be relaxed i guess, but I am not yet 100% familiar with
> this code.
>
> Use cases for this are along the lines of my example, smaller, easier user
> code and performance (token filters mainly)
>
>
>
> - Original Message 
>
>
> From: Uwe Schindler 
> To: java-dev@lucene.apache.org
> Sent: Sunday, 26 April, 2009 23:03:06
> Subject: RE: new TokenStream api Question
>
> There is one problem: if you extend TermAttribute, the class is different
> (which is the key in the attributes list). So when you initialize the
> TokenStream and do a
>
> YourClass termAtt = (YourClass) addAttribute(YourClass.class)
>
> ...you create a new attribute. So one possibility would be to also specify
> the instance and save the attribute by class (as key), but with your
> instance. If you are the first one that creates the attribute (if it is a
> token stream and not a filter it is ok, you will be the first, it adding the
> attribute in the ctor), everything is ok. Register the attribute by yourself
> (maybe we should add a specialized addAttribute, that can specify a instance
> as default)?:
>
> YourClass termAtt = new YourClass();
> attributes.put(TermAttribute.class, termAtt);
>
> In this case, for the indexer it is a standard TermAttribute, but you can
> more with it.
>
> Replacing TermAttribute by an own class is not possible, as the indexer will
> get a ClassCastException when using the instance retrieved with
> getAttribute(TermAttribute.class).
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>
> -Original Message-
> From: eks dev [mailto:eks...@yahoo.co.uk]
> Sent: Sunday, April 26, 2009 10:39 PM
> To: java-dev@lucene.apache.org
> Subject: new TokenStream api Question
>
>
> I am j

[jira] Created: (LUCENE-1620) Ho w to index and Search the special characters as well as non-englis h characters like danish Å,ø,etc

2009-04-28 Thread uday kumar maddigatla (JIRA)

How to index and Search the special characters as well as non-english 
characters like danish  Å,ø,etc
-

 Key: LUCENE-1620
 URL: https://issues.apache.org/jira/browse/LUCENE-1620
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index, QueryParser, Search
Affects Versions: 2.4.1
 Environment: windows xp,jdk 6
Reporter: uday kumar maddigatla


Hi 

I just started to use lucene. I found one thing un usal. In my documents few 
contains english ,xml elements  as well as danish elements.

I just used StandardAnalyzer to index the documents and i found that this 
analyzer is not recognizing the special characers as well as danish elements.

So how can i use lucene in order to index the special characters like '(', ')', 
':' etc and also danish elements like Å,ø,etc.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: RangeQuery and getTerm

2009-04-28 Thread Mark Miller

Okay, I agree - best would be to lose the method that does not make 
sense for all multiterm queries.


I'll work on deprecating it and moving getTerm up to the sub queries 
that it makes sense for.


- Mark

Uwe Schindler wrote:

During my implementations on trie range, I was always wondering, why
MultiTermQuery has this method. It seems to be relict from the past. The
term is only used in Fuzzy* (as far as I have seen).

Why no deprecate getTerm() in MultiTermQuery, remove the field in
MultiTermQuery and all related occurrences? The field and methods are then
*not* deprecated and senseful implemented in Fuzzy*.

In my opinion, the MultiTermQuery should only provide the functionality to
handle FilteredTermEnums and everything else should be left to the
implementor.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: Tuesday, April 28, 2009 3:28 AM
To: java-dev@lucene.apache.org
Subject: RangeQuery and getTerm

RangeQuery is based on two terms rather than one, and currently returns
null from getTerm.

This can lead to less than obvious null pointer exceptions. I'd almost
prefer to throw UnsupportedOperationException.

However, returning null allows you to still use getTerm on
MultiTermQuery and do a null check in the RangeQuery case. Not sure how
valuable that really is though.

Thoughts?

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703613#action_12703613
 ] 

Shai Erera commented on LUCENE-1593:


bq. But actually: the thing calling scoresDocsInOrder will in fact only be 
calling that method if it intends to use the scorer as a toplevel scorer

Are you sure? The way I understand it IndexSearcher will call 
weight.getQuery().scoresDocInOrder() in the search methods that create a 
Collector, in order to know whether to create an "in-order" Collector or 
"out-of-order" Collector. At this point it does not know whether it will use 
the scorer as a top-level or not. Unless we duplicate the logic of doSearch 
into those methods (i.e. if there is a filter know it'll be used as a top-level 
Collector), but I really don't like to do that.

I still think there are two issues here that need to be addressed separately:
# Allowing IS as well as any Collector-creating code to create the right 
Collector instance - in/out-of order. That is achievable by adding 
scoresDocsInOrder() to Query, defaulting to false (for back-compat) and 
override in all Query implementations, where it makes sense. For BQ I think it 
should remain false, with a TODO to change in 3.0 (see second bullet).
# Clearly separate between BS and BS2, i.e. have BW create one of them 
explicitly without wrapping or anything. That is achievable, I think, by adding 
topScorer() to Weight and call it from IS. Then in BW we do whatever 
BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that 
can happen only in 3.0. We then change scoresDocsInOrder to return false only 
if BQ was set to return docs out of order as well as there are 0 required 
scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we 
add start() to DISI then in BS2 we can override it to initialize CSS, and 
calling start() from IS.doSearch before it starts iterating. In 
score(Collector) it will check if it's initialized only once, so it should be 
ok?

What do you think?

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 status (to port to Lucene.Net)

2009-04-28 Thread Uwe Schindler

Hi Mike,

> This is great feedback on the new Collector API, Uwe.  Thanks!

- Likewise.

> It's awesome that you no longer have to warm your searchers... but be
> careful when a large segment merge commits.

I know this, but in our case (e.g. creating a IN-SQL list, collecting
measurement parameters from the documents) the warming is not really needed,
it would only be a problem if it is very often (the index is updated every
20 minutes) and it must reload the whole field cache (takes 3-5 seconds on
our machine). So a large merge taking 1-2 seconds for cache reloading is no
problem (the users have the same problem with sorted results). If our index
gets bigger, I will add warming in my search/cache implementation after
reopening, for that it would be nice, to have the list of reopened segments
(I think there was a issue about it, or is there an implementation?).
In our case, most time takes the query in the SQL data warehouse after it,
so 1 second additionally for building the SQL query is not much.
 
> Did you hit any snags/problems/etc. that we should fix before releasing
> 2.9?

Until now, I have not seen any further problems. What I have seen befor is
already implemented in Lucene with our active issue communication and all
these issues :-)

I still wait for the step towards moving trie (and also the new automaton
regex query) to core and the modularization (hopefully before 2.9, to not
create new APIs that change/deprecate later).

Uwe

> Mike
> 
> On Sun, Apr 26, 2009 at 9:54 AM, Uwe Schindler  wrote:
> > Some status update:
> >
> >> > George, did you mean LUCENE-1516 below?  (LUCENE-1313 is a further
> >> > improvement to near real-time search that's still being iterated on).
> >> >
> >> > In general I would say 2.9 seems to be in rather active development
> >> still
> >> > ;)
> >> >
> >> > I too would love to hear about production/beta use of 2.9.  George
> >> > maybe you should re-ask on java-user?
> >>
> >> Here! I updated www.pangaea.de to Lucene-trunk today (because of
> >> incomplete
> >> hashcode in TrieRangeQuery)... Works perfect, but I do not use the
> >> realtime
> >> parts. And 10 days before the same, no problems :-)
> >>
> >> Currently I rewrite parts of my code to Collector to go away from
> >> HitCollector (without score, so optimizations)! The reopen() and
> sorting
> >> is
> >> fine, almost no time is consumed for sorted searches after reopening
> >> indexes
> >> every 20 minutes with just some new and small segments with changed
> >> documents. No extra warming is needed.
> >
> > I rewrote my collectors now to use the new API. Even through the number
> of
> > methods to overwrite in the new collector is 3 instead of 1, the code
> got
> > shorter (because the collect methods now can throw IOExceptions,
> great!!!).
> > What is also perfect is the way how to use a FieldCache: Just retrieve
> the
> > FieldCache array (e.g. getInts()) in the setNextReader() method and use
> the
> > value array in the collect() method with the docid as index. Now I am
> able
> > to e.g. retrieve cached values even after an index reopen without
> warming
> > (same with sort). In the past you had to use a cache array for the whole
> > index. The docBase is not used in my code, as I directly access the
> index
> > readers. So users now have both possibilities: use the supplied reader
> or
> > use the docBase as index offset into the searcher/main reader. Really
> cool!
> >
> > The overhead of score calculation can be left out, if not needed, also
> cool!
> >
> > One of my collectors is used retrieve the database ids (integers) for
> > building up a SQL "IN (...)" from the field cache based on the collected
> > hits. In the past this was very complicated, because FieldCache was slow
> > after reopening and getting stored fields (the ids) is also very slow
> (inner
> > search loop). Now it's just 10 lines of code and no score is involved.
> >
> > The new code is working now in production at PANGAEA.
> >
> >> Another change to be done here is Field.Store.COMPRESS and replace by
> >> manually compressed binary stored fields, but this is only to get rid
> of
> >> the
> >> deprecated warnings. But this cannot be done without complete
> reindexing.
> >>
> >> Uwe
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-de

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-04-28 Thread uday kumar maddigatla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703598#action_12703598
 ] 

uday kumar maddigatla commented on LUCENE-1488:
---

hi,

i too just facing the same problem. my documet contains english as well as 
danish elements.

I tried to use this analyzer. when i try to use this i got this error .

Exception in thread "main" java.lang.ExceptionInInitializerError
at 
org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74)
at 
org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117)
at 
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)
at com.IndexFiles.indexDocs(IndexFiles.java:87)
at com.IndexFiles.indexDocs(IndexFiles.java:80)
at com.IndexFiles.main(IndexFiles.java:57)
Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17
at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505)
at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047)
at 
com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484)
at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912)
at 
com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298)
at 
com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316)
at 
com.ibm.icu.text.RuleBasedBreakIterator.(RuleBasedBreakIterator.java:71)
at 
org.apache.lucene.analysis.icu.ICUBreakIterator.(ICUBreakIterator.java:53)
at 
org.apache.lucene.analysis.icu.ICUBreakIterator.(ICUBreakIterator.java:45)
at 
org.apache.lucene.analysis.icu.ICUTokenizer.(ICUTokenizer.java:58)
... 12 more

please help me in this.

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 6:09 AM, Shai Erera  wrote:
> Hi
>
> I think I've hit a bug in ConcurrentMergeScheduler, but I'd like those who
> are more familiar with the code to review it. I ran
> TestStressSort.testSort() and started to get AIOOB exceptions from
> MergeThread, the CPU spiked to 98-100% and did not end for a couple of
> minutes, until I was able to regain control and kill the process (looks like
> an infinite loop).
>
> To reproduce it all you need is to add the following line to
> PQ.initialize(): size = maxSize, and then you'll get the aforementioned
> exceptions. I did it acceindentally, but I'm sure there's a way to reproduce
> it with a JUnit test or something so that it will happen consistently.

It sounds like this caused every merge to always hit an exception while merging?

> When I debugged-trace the test, I noticed that MergeThread are just spawned
> forever. The reason is this: In CMS.merge(IndexWriter) there's a 'while
> (true)' loop which does 'while (mergeThreadCount() >= maxThreadCount)' and
> if false just spawns a new MergeThread. On the other hand, in
> MergeThread.run there's a try-finally which executes whatever it needs to
> execute and in the finally block removes this thread from the list of
> threads. That causes CMS to spawn a new thread, which will hit another
> exception, remove itself from the queue and CMS will spawn a new thread.
> That puts the code into an infinite loop.

Unfortunately, this is tricky to fix correctly, because
IW/MergeScheduler knows so little about what actually failed.

Right now, if a merge hits an exception, IW simply undoes everything
it did (removes partial files, allows new merges to try merging the
segments again, etc).

If it was some transient IO error, or say a transient disk full
situation, retrying the merge seems good.  But if it's some
[temporary] bug in Lucene, and every merge will always hit an
exception, then retrying is hopeless.  Likewise a corrupt index, a
disk full that won't clear up, sudden permission denied errors on
opening new files, etc., retrying is hopeless.

> That sounds like a bug to me ... I think that if MergeThread hits any
> exception, the merge should fail? Anyway, the exception is added to an
> exceptions List, which is a private member of CMS but is never chceked by
> CMS. Perhaps merge(IndexWriter) should check if the exceptions list is not
> empty and fail the merge in such case?

Actually the merge does "fail", and IW "undoes" its changes, but then
MergePolicy is free to pick merges again, and in turn picks the very
same merge.

The exceptions list is normally only checked during Lucene's unit
tests, but apps could check it as well.

> Anyway, I'll fix PQ's code now to continue my work, but if you want to
> reproduce it, it's as easy as adding size = maxSize to initialize() and run
> TestStressSort.
>
> I don't mind to open an issue and fix it (though I'm not sure what should
> the fix be at the moment, but I'll figure it out), but it will have to wait,
> so if you know the code and can put a patch together quickly, don't wait up
> for me :)

I think maybe the best/simplest fix is to simply sleep for a bit (250
msec?) on hitting an exception while merging?  This way CPU won't be
pegged, you won't suddenly see zillions of exceptions streaming by,
etc.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703600#action_12703600
 ] 

Michael McCandless commented on LUCENE-1593:


bq. I actually prefer to add a boolean prePopulate to HitQueue's ctor.

+1

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-04-28 Thread Mark Miller (JIRA)

deprecate term and getTerm in MultiTermQuery


 Key: LUCENE-1621
 URL: https://issues.apache.org/jira/browse/LUCENE-1621
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9


This means moving getTerm and term up to sub classes as appropriate and 
reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Shai Erera

Every merge hit the exception, yes.

And actually, the exceptions list is not used anywhere besides MT adding the
exception to the list. That's why I was curious why it's there.

I still think we should protect this case somehow, because even if it hits a
disk-full exception, there's no point continuing to run infinitely? So maybe
before spwaning the next thread, check the exceptions list and if it goes
over a certain threshold (10?) fail?

On Tue, Apr 28, 2009 at 3:23 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Apr 28, 2009 at 6:09 AM, Shai Erera  wrote:
> > Hi
> >
> > I think I've hit a bug in ConcurrentMergeScheduler, but I'd like those
> who
> > are more familiar with the code to review it. I ran
> > TestStressSort.testSort() and started to get AIOOB exceptions from
> > MergeThread, the CPU spiked to 98-100% and did not end for a couple of
> > minutes, until I was able to regain control and kill the process (looks
> like
> > an infinite loop).
> >
> > To reproduce it all you need is to add the following line to
> > PQ.initialize(): size = maxSize, and then you'll get the aforementioned
> > exceptions. I did it acceindentally, but I'm sure there's a way to
> reproduce
> > it with a JUnit test or something so that it will happen consistently.
>
> It sounds like this caused every merge to always hit an exception while
> merging?
>
> > When I debugged-trace the test, I noticed that MergeThread are just
> spawned
> > forever. The reason is this: In CMS.merge(IndexWriter) there's a 'while
> > (true)' loop which does 'while (mergeThreadCount() >= maxThreadCount)'
> and
> > if false just spawns a new MergeThread. On the other hand, in
> > MergeThread.run there's a try-finally which executes whatever it needs to
> > execute and in the finally block removes this thread from the list of
> > threads. That causes CMS to spawn a new thread, which will hit another
> > exception, remove itself from the queue and CMS will spawn a new thread.
> > That puts the code into an infinite loop.
>
> Unfortunately, this is tricky to fix correctly, because
> IW/MergeScheduler knows so little about what actually failed.
>
> Right now, if a merge hits an exception, IW simply undoes everything
> it did (removes partial files, allows new merges to try merging the
> segments again, etc).
>
> If it was some transient IO error, or say a transient disk full
> situation, retrying the merge seems good.  But if it's some
> [temporary] bug in Lucene, and every merge will always hit an
> exception, then retrying is hopeless.  Likewise a corrupt index, a
> disk full that won't clear up, sudden permission denied errors on
> opening new files, etc., retrying is hopeless.
>
> > That sounds like a bug to me ... I think that if MergeThread hits any
> > exception, the merge should fail? Anyway, the exception is added to an
> > exceptions List, which is a private member of CMS but is never chceked by
> > CMS. Perhaps merge(IndexWriter) should check if the exceptions list is
> not
> > empty and fail the merge in such case?
>
> Actually the merge does "fail", and IW "undoes" its changes, but then
> MergePolicy is free to pick merges again, and in turn picks the very
> same merge.
>
> The exceptions list is normally only checked during Lucene's unit
> tests, but apps could check it as well.
>
> > Anyway, I'll fix PQ's code now to continue my work, but if you want to
> > reproduce it, it's as easy as adding size = maxSize to initialize() and
> run
> > TestStressSort.
> >
> > I don't mind to open an issue and fix it (though I'm not sure what should
> > the fix be at the moment, but I'll figure it out), but it will have to
> wait,
> > so if you know the code and can put a patch together quickly, don't wait
> up
> > for me :)
>
> I think maybe the best/simplest fix is to simply sleep for a bit (250
> msec?) on hitting an exception while merging?  This way CPU won't be
> pegged, you won't suddenly see zillions of exceptions streaming by,
> etc.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 8:10 AM, Uwe Schindler  wrote:

>> It's awesome that you no longer have to warm your searchers... but be
>> careful when a large segment merge commits.
>
> I know this, but in our case (e.g. creating a IN-SQL list, collecting
> measurement parameters from the documents) the warming is not really needed,
> it would only be a problem if it is very often (the index is updated every
> 20 minutes) and it must reload the whole field cache (takes 3-5 seconds on
> our machine). So a large merge taking 1-2 seconds for cache reloading is no
> problem (the users have the same problem with sorted results). If our index
> gets bigger, I will add warming in my search/cache implementation after
> reopening, for that it would be nice, to have the list of reopened segments
> (I think there was a issue about it, or is there an implementation?).
> In our case, most time takes the query in the SQL data warehouse after it,
> so 1 second additionally for building the SQL query is not much.

OK that's great.

>> Did you hit any snags/problems/etc. that we should fix before releasing
>> 2.9?
>
> Until now, I have not seen any further problems. What I have seen befor is
> already implemented in Lucene with our active issue communication and all
> these issues :-)

Tell me about it... hard to keep them all straight!  Lot's of great
improvements in 2.9...

> I still wait for the step towards moving trie (and also the new automaton
> regex query) to core and the modularization (hopefully before 2.9, to not
> create new APIs that change/deprecate later).

+1

We need to do something about modularization / move trie to core before 2.9.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 8:28 AM, Shai Erera  wrote:
> Every merge hit the exception, yes.
>
> And actually, the exceptions list is not used anywhere besides MT adding the
> exception to the list. That's why I was curious why it's there.

It's there so "anyUnhandledExceptions" can be called; we could add a
getter so that an app could query the CMS to see if there were
exceptions?  But, an app can also subclass CMS & override
handleMergeException to do something.

> I still think we should protect this case somehow, because even if it hits a
> disk-full exception, there's no point continuing to run infinitely?

The disk full could clear up, eg if something external was trying to
copy a massive file onto the same disk, IW could hit disk full, then
the copy would fail and remove the partially copied massive file, and
lots of space becomes available again.

> So maybe
> before spwaning the next thread, check the exceptions list and if it goes
> over a certain threshold (10?) fail?

But what does "fail" mean?  Stop doing any merges forever, for this IW
instance?  That seems dangerous.  EG maybe the massive merge will
fail, but little merges can proceed.  Also, stopping merging like this
spooks me because you might see a few exceptions, but then they stop
and you think all is good but in fact way too many segments are piling
up.

We already "fail" now, in that the thread that was doing the merge
will throw an exception up to the JRE's default thread exception
handler.  It's just that we then let MergePolicy select that merge
again.

I like the "sleep on exception" because 1) you will continue to see
that exceptions are being thrown, but 2) it won't saturate your CPU.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-04-28 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1621:


Attachment: LUCENE-1621.patch

a quick first pass at this

> deprecate term and getTerm in MultiTermQuery
> 
>
> Key: LUCENE-1621
> URL: https://issues.apache.org/jira/browse/LUCENE-1621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1621.patch
>
>
> This means moving getTerm and term up to sub classes as appropriate and 
> reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-04-28 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1621:


Component/s: Search

> deprecate term and getTerm in MultiTermQuery
> 
>
> Key: LUCENE-1621
> URL: https://issues.apache.org/jira/browse/LUCENE-1621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1621.patch
>
>
> This means moving getTerm and term up to sub classes as appropriate and 
> reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703630#action_12703630
 ] 

Tim Smith commented on LUCENE-1618:
---

{quote}
You mean an opened IndexOutput would write its output to two (or more) 
different places? So you could "write through" a RAMDir down to an FSDir? (This 
way both the RAMDir and FSDir have a copy of the index).
{quote}

yes, so if you register more than one directory for "index files", then the 
IndexOutput for the directory would dispatch to an IndexOutput for both sub 
directories
then, the IndexInput would only be opened on the "primary" directory (for 
instance, the RAM directory)

This will allow extremely fast searches, with the persistence of a backing 
FSDirectory

coupled with then having a set of directories for the "Stored Documents", then 
allows:
* RAM directory search speed
* All changes persisted to disk
* Documents Stored (and retrieved from disk) (or optionally retrieved from RAM)


> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Shai Erera

>
> It's there so "anyUnhandledExceptions" can be called;
>

I will check the code again, but I remember that after commenting it, the
only compile errors I saw were from MergeThread adding the exception ...
perhaps I'm missing something, so I'll re-check the code.

I understand your point now - merging is an internal process to IW,
therefore there's no real user to notify on errors (e.g., even if IW knew
there is an error, what would it do exactly?), and I guess keep trying to
execute the merges is reasonable (while throwing the exceptions further - in
hope that the user code will catch those and do something with them).

BTW, what will happen if I encounter such exceptions, that are thrown
repeatedly, and shutdown the JVM? I guess the index will not be in a corrupt
state, right? The next time I'll open it, it should be in a state prior to
the merge, or at least prior to the merge that failed?

I think that sleeping in case of exceptions makes sense .. in case of IO
errors that are temporary, this will not spawn threads endlessly, and
sleeping will give an opportunity for the IO problem to resolve. In case of
bugs, which are supposed to be detected during test time, it should give the
developer a chance to kill the process relatively quickly.

On Tue, Apr 28, 2009 at 2:39 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Apr 28, 2009 at 8:28 AM, Shai Erera  wrote:
> > Every merge hit the exception, yes.
> >
> > And actually, the exceptions list is not used anywhere besides MT adding
> the
> > exception to the list. That's why I was curious why it's there.
>
> It's there so "anyUnhandledExceptions" can be called; we could add a
> getter so that an app could query the CMS to see if there were
> exceptions?  But, an app can also subclass CMS & override
> handleMergeException to do something.
>
> > I still think we should protect this case somehow, because even if it
> hits a
> > disk-full exception, there's no point continuing to run infinitely?
>
> The disk full could clear up, eg if something external was trying to
> copy a massive file onto the same disk, IW could hit disk full, then
> the copy would fail and remove the partially copied massive file, and
> lots of space becomes available again.
>
> > So maybe
> > before spwaning the next thread, check the exceptions list and if it goes
> > over a certain threshold (10?) fail?
>
> But what does "fail" mean?  Stop doing any merges forever, for this IW
> instance?  That seems dangerous.  EG maybe the massive merge will
> fail, but little merges can proceed.  Also, stopping merging like this
> spooks me because you might see a few exceptions, but then they stop
> and you think all is good but in fact way too many segments are piling
> up.
>
> We already "fail" now, in that the thread that was doing the merge
> will throw an exception up to the JRE's default thread exception
> handler.  It's just that we then let MergePolicy select that merge
> again.
>
> I like the "sleep on exception" because 1) you will continue to see
> that exceptions are being thrown, but 2) it won't saturate your CPU.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-04-28 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703645#action_12703645
 ] 

Robert Muir commented on LUCENE-1488:
-

what version of icu4j are you using? needs to be >= 4.0

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 9:27 AM, Shai Erera  wrote:
>> It's there so "anyUnhandledExceptions" can be called;
>
> I will check the code again, but I remember that after commenting it, the
> only compile errors I saw were from MergeThread adding the exception ...
> perhaps I'm missing something, so I'll re-check the code.

The oal.util.LuceneTestCase wraps each test so that if any unhandled
CMS exceptions happened during a test, the test fails.

> I understand your point now - merging is an internal process to IW,
> therefore there's no real user to notify on errors (e.g., even if IW knew
> there is an error, what would it do exactly?), and I guess keep trying to
> execute the merges is reasonable (while throwing the exceptions further - in
> hope that the user code will catch those and do something with them).

OK.

> BTW, what will happen if I encounter such exceptions, that are thrown
> repeatedly, and shutdown the JVM? I guess the index will not be in a corrupt
> state, right? The next time I'll open it, it should be in a state prior to
> the merge, or at least prior to the merge that failed?

The index will never be corrupt (except for bugs ;) ) -- it will
contain whatever the last commit was (which is typically well before
the merges started).  EG w/ autoCommit=false, the index as of when you
last called commit (or as of when you opened it, if you have not
called commit) will remain intact on an abnormal shutdown.

> I think that sleeping in case of exceptions makes sense .. in case of IO
> errors that are temporary, this will not spawn threads endlessly, and
> sleeping will give an opportunity for the IO problem to resolve. In case of
> bugs, which are supposed to be detected during test time, it should give the
> developer a chance to kill the process relatively quickly.

OK I'll commit this.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703651#action_12703651
 ] 

Michael McCandless commented on LUCENE-1618:


Neat.  This is sounding like one cool Directory...

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703656#action_12703656
 ] 

Earwin Burrfoot commented on LUCENE-1618:
-

bq. You mean an opened IndexOutput would write its output to two (or more) 
different places?
Except the best way is to write directly to FSDir.IndexOutput, and when it is 
closed, read back into memory.
That way, if FSDir.IO hits an exception while writing, you don't have to jump 
through the hoops to keep your RAMDir in consistent state (we had real troubles 
when some files were 'written' to RAMDir, but failed to persist in FSDir).
Also, when reading the file back you already know it's exact size and can 
allocate appropriate buffer, saving on resizings (my draft impl) / chunking 
(lucene's current impl) overhead.

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1618:


Attachment: MemoryCachedDirectory.java

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703658#action_12703658
 ] 

Yonik Seeley commented on LUCENE-1618:
--

As it relates to near real time, the search speed of the RAM directory in 
relation to FSDirectory seems unimportant (what is this diff anyway?) - the 
FSDirectory will be much larger and that is where the bulk of the search time 
will be.

It seems like the main benefit of RAMDirectory for NRT is faster creation time 
(no need to create on-disk files, write them, then sync them), right?  Actually 
the sync is only needed if a new segments file will be written... but there 
still may be synchronous metadata operations for open-write-close of a file, 
depending on the FS?


> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-28 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703660#action_12703660
 ] 

Marvin Humphrey commented on LUCENE-1614:
-

> nudge doesn't sound like it changes anything, but just "touches".

If you say so.  In Lucy, I expect we'll use "next" and "advance".  

> if distinct method names is what we're after

Yes, that's the idea.  These two methods are very different from each other.
The official definition of skipTo() has many subtle gotchas.  Just because
they both move the iterator forward doesn't mean they do the same thing, and
it is cumbersome and taxing to have to differentiate between methods using
long-form signatures in the midst of standard prose.

There's no good reason to conflate these two methods, just as there's no 
good reason why we should be forced to write "search(Collector)" instead
of "collect()" or "collectHits()".

> I prefer nextDoc() and skipToDoc() or advance() for the latter. 

IMO, "advance" more accurately describes what that method does than either
"skipTo" or "skipToDoc".  The problem is that if you're on doc 10, then
skipToDoc(10) doesn't, in fact, skip to doc 10 as the method name implies --
it takes you to at least doc 11.  Furthermore, "advance" reinforces that you
can only seek forwards.


> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703666#action_12703666
 ] 

Earwin Burrfoot commented on LUCENE-1618:
-

bq. what is this diff anyway?
That's not a diff, I gave a sample of write-through ram directory Tim and Mike 
were speaking about.

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703667#action_12703667
 ] 

Michael McCandless commented on LUCENE-1593:


bq. The way I understand it IndexSearcher will call 
weight.getQuery().scoresDocInOrder() in the search methods that create a 
Collector, in order to know whether to create an "in-order" Collector or 
"out-of-order" Collector. At this point it does not know whether it will use 
the scorer as a top-level or not. Unless we duplicate the logic of doSearch 
into those methods (i.e. if there is a filter know it'll be used as a top-level 
Collector), but I really don't like to do that.

Yeah you're right, it is in two separate places today.

Though since we are reworking how filters are applied, at that point
it may very well be in one place.

bq. Allowing IS as well as any Collector-creating code to create the right 
Collector instance - in/out-of order. That is achievable by adding 
scoresDocsInOrder() to Query, defaulting to false (for back-compat) and 
override in all Query implementations, where it makes sense. For BQ I think it 
should remain false, with a TODO to change in 3.0 (see second bullet).

OK let's tentatively move forwards with Query.scoresDocsInOrder.

bq. Clearly separate between BS and BS2, i.e. have BW create one of them 
explicitly without wrapping or anything. That is achievable, I think, by adding 
topScorer() to Weight and call it from IS. Then in BW we do whatever 
BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that 
can happen only in 3.0. We then change scoresDocsInOrder to return false only 
if BQ was set to return docs out of order as well as there are 0 required 
scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

OK let's slate this for 3.0, then.

bq. BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if 
we add start() to DISI then in BS2 we can override it to initialize CSS, and 
calling start() from IS.doSearch before it starts iterating. In 
score(Collector) it will check if it's initialized only once, so it should be 
ok?

OK let's move forwards with this too?

Phew!


> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703670#action_12703670
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

Hi, 

I think that the fact that the tool relies on an external free/open-source 
package to pre-process the files to be indexed should not be an obstacle for 
the community to benefit from them; the world is pretty heterogeneous ;). 
Furthermore, they are not required at search time. 

> Felipe, although Java equivalents of those command-line tools don't exist 
> currently, do you think one could implement them in Java (and release them 
> under ASL)? 

This year the Apertium project is in the Google Summer of Code. A student will 
port the ltoolbox package to Java. Note that the tool I contribute also uses 
the apertium tagger and that this tool will not be ported; fortunately the 
usage of the tagger is optional.  The Java version of lttoolbox will be 
released under the GPL license, I am not sure if they will accept to give it a 
dual license.

--
Felipe

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703676#action_12703676
 ] 

Yonik Seeley commented on LUCENE-1618:
--

bq.  That's not a diff

Sorry, by "diff" I meant the difference in search performance on a RAMDirectory 
vs NIOFSDirectory where the files are all cached by the OS.

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-28 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703677#action_12703677
 ] 

Marvin Humphrey commented on LUCENE-1614:
-

Further illustration...

Good method signature overloading, from IndexReader.java:

{noformat}
  public static boolean indexExists(String directory)

  public static boolean indexExists(File directory)

  public static boolean indexExists(Directory directory);
{noformat}

Bad method signature overloading, from Searcher.java:

{noformat}
  public Hits search(Query query, Filter filter, Sort sort)

  public TopFieldDocs search(Query query, Filter filter, int n, Sort sort)

  public void search(Query query, HitCollector results)
{noformat}

IMO, those three methods on Searcher should be named hits(), topFieldDocs(),
and collect(), rather than search(), search(), and search(), making code that
uses those methods more self documenting, and making it easier to discuss them
out of context.

For the same reasons, we should have different names for nextDoc() and
advance().


> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703683#action_12703683
 ] 

Michael McCandless commented on LUCENE-1618:


bq. by "diff" I meant the difference in search performance on a RAMDirectory vs 
NIOFSDirectory where the files are all cached by the OS.

It's a good question -- I haven't tested it directly.  I'd love to know too...

For an NRT writer using RAMDir for recently flushed tiny segments 
(LUCENE-1313), the gains are more about the speed of reading/writing many tiny 
files.  Probably we should try [somehow] to test this case, to see if 
LUCENE-1313 is even a worthwhile optimization.

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703684#action_12703684
 ] 

Earwin Burrfoot commented on LUCENE-1618:
-

bq. Sorry, by "diff" I meant the difference in search performance on a 
RAMDirectory vs NIOFSDirectory where the files are all cached by the OS.
Ah! :) It exists. Ranked by speed, directories are FSDirectory (native/sys 
calls), MMapDirectory (native), RAMDirectory (chunked), MemCachedDirectory (raw 
array access). But for the purporses of searching a small amount of 
freshly-indexed docs this difference is miniscule at best, me thinks.

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703686#action_12703686
 ] 

Michael McCandless commented on LUCENE-1313:


Yonik raised a good question on LUCENE-1618, which is what gains do we really 
expect to see by using RAMDir for the tiny recently flushed segments?

It would be nice if we could approximately measure this before putting more 
work into this issue -- if the gains are not "decent" this optimization may not 
be worthwhile.

Of course, we are talking about 100s of milliseconds for the turnaround time to 
add docs & open an NRT reader, so if the time for writing/opening many tiny 
files in RAMDir vs FSDir  differs by say 10s of msecs then we should pursue 
this.  We should also consider that the IO system may very well be quite busy 
(doing merge(s), backups, etc.) and that'd make it slower to have to create 
tiny files.

A simpler optimization might be to allow using CFS for tiny files (even when 
CFS is turned off), but built the CFS in RAM (ie, write tiny files first to 
RAMFiles, then make the CFS file on disk).  That might get most of the gains 
since the FSDir sees only one file created per tiny segment, not N.

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-28 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703695#action_12703695
 ] 

Yonik Seeley commented on LUCENE-1313:
--

bq. Yonik raised a good question on LUCENE-1618, which is what gains do we 
really expect to see by using RAMDir for the tiny recently flushed segments?

I raised it more because of the direction the discussion was veering (write 
through caching to a RAMDirectory, and RAMDirectory being faster to *search*).  
I do believe that RAMDirectory can probably improve NRT, but  it would be due 
to avoiding waiting for file open/write/close/open/read (as Mike also said)... 
and not any difference during IndexSearcher.search(), which should be 
irrelevant due to the relative size differences of the RAMDirectory and the 
FSDirectory.  Small file creation speeds will also be heavily dependent on the 
exact file system used.


> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-04-28 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703733#action_12703733
 ] 

Mark Harwood commented on LUCENE-1621:
--

While we're poking around in this area I'd like to point out the long-standing 
open issue in LUCENE-329.

Matching "Smyth" over "Smith" when doing a search for "Smith~" is just plain 
broken but this is what I see all the time with FuzzyQuery and it's default 
approach to IDF. I think we need to take the sort of logic in contrib's 
FuzzyLikeThisQuery to address this. 

> deprecate term and getTerm in MultiTermQuery
> 
>
> Key: LUCENE-1621
> URL: https://issues.apache.org/jira/browse/LUCENE-1621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1621.patch
>
>
> This means moving getTerm and term up to sub classes as appropriate and 
> reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [Lucene-java Wiki] Update of "LuceneAtApacheConUs2009" by MichaelBusch

2009-04-28 Thread Jason Rutherglen

Michael,

I updated the wiki under "New Features in Lucene".  I can give a
presentation on realtime search in Lucene.

-J

On Mon, Apr 27, 2009 at 10:11 PM, Michael Busch  wrote:

> I'm happy to give more than one talk, on the other hand I don't want to
> prevent others from presenting. So if anyone likes to give similar talks to
> the ones I suggested, please let us know.
>
> -Michael
>
> On 4/27/09 10:07 PM, Apache Wiki wrote:
>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Lucene-java Wiki"
>> for change notification.
>>
>> The following page has been changed by MichaelBusch:
>> http://wiki.apache.org/jakarta-lucene/LuceneAtApacheConUs2009
>>
>>
>> --
>>Let's wait to fill this in until Concom provides us a list from the
>> regular CFP process.
>>
>>   = Possible Talks or Tutorials =
>> -  * Lucene Basics (Michael Busch)
>> +  * Lucene Basics (Michael Busch or others?)
>>* Intro to Solr (:  Hoss out of the box talk?)
>>* Intro to Nutch and/or Nutch Vertical Search (Andrzej Bialecki) (when
>> was the last time we had a Nutch talk? ''probably never...'')
>>* Mime Magic with Apache Tika (Jukka Zitting)
>> @@ -34, +34 @@
>>
>>
>>
>>
>> -  * New Features in Lucene (Michael Busch)
>> +  * New Features in Lucene (Michael Busch or others?)
>>* Advanced Lucene Indexing (Michael Busch)
>>* Building Intelligent Search Applications with the Lucene Ecosystem
>> (Grant Ingersoll)  - see abstract at bottom
>>* Solr Operations and Performance Tuning
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Created: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2009-04-28 Thread Dawid Weiss (JIRA)

Multi-word synonym filter (synonym expansion at indexing time).
---

 Key: LUCENE-1622
 URL: https://issues.apache.org/jira/browse/LUCENE-1622
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Dawid Weiss
Priority: Minor
 Attachments: synonyms.patch

It would be useful to have a filter that provides support for indexing-time 
synonym expansion, especially for multi-word synonyms (with multi-word matching 
for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was 
able to identify (mentioned in the unit tests as well):

- if multi-word synonyms are indexed together with the original token stream 
(at overlapping positions), then a query for a partial synonym sequence (e.g., 
"big" in the synonym "big apple" for "new york city") causes the document to 
match;

- there are problems with highlighting the original document when synonym is 
matched (see unit tests for an example),

- if the synonym is of different length than the original sequence of tokens to 
be matched, then phrase queries spanning the synonym and the original sequence 
boundary won't be found. Example "big apple" synonym for "new york city". A 
phrase query "big apple restaurants" won't match "new york city restaurants".

I am posting the patch that implements phrase synonyms as a token filter. This 
is not necessarily intended for immediate inclusion, but may provide a basis 
for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2009-04-28 Thread Dawid Weiss (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-1622:


Attachment: synonyms.patch

Token filter implementing synonyms. Java 1.5 is required to compile it (I left 
generics for clarity; if folks really need 1.4 compatibility they can be easily 
removed of course).

> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Synonym filter with support for phrases?

2009-04-28 Thread Dawid Weiss



Apologies for the delay, guys. I tried to solve certain issues that didn't pop 
up in my application (as Kirill said, the problem is indeed quite complex). I 
didn't find all the answers I had been looking for, but nonetheless -- the patch 
that works for my needs is in JIRA. I would be really interested in something 
that does a better job (see the unit tests -- there are certain comments I made 
about the current functionality and its shortcomings), but I couldn't figure out 
a way to do it better.


Dawid

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2009-04-28 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790
 ] 

Earwin Burrfoot edited comment on LUCENE-1622 at 4/28/09 11:50 AM:
---

I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in 
group is a big win with index size and saves you from matching for "big". It 
also plays better with highlighting (still had to rewrite it to handle all 
corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you 
have to dabble in query rewriting (even then low-probability corner cases 
exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms 
with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to 
enable fully proper multiword synonyms (with index+query rewrite approach) - 
adding a notion of 'length' or 'span' to a token, this length should play 
together with positionIncrement when calculating distance between tokens in 
phrase/spannear queries.

  was (Author: earwin):
I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in 
group is a big win with index size and saves you from matching for "big". It 
also plays better with highlighting (still had to rewrite it to handle all 
corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you 
have to dabble in query rewriting (even then low-probability corner cases 
exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms 
with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to 
enable fully proper multiword synonyms - adding a notion of 'length' or 'span' 
to a token, this length should play together with positionIncrement when 
calculating distance between tokens in phrase/spannear queries.
  
> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2009-04-28 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790
 ] 

Earwin Burrfoot commented on LUCENE-1622:
-

I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in 
group is a big win with index size and saves you from matching for "big". It 
also plays better with highlighting (still had to rewrite it to handle all 
corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you 
have to dabble in query rewriting (even then low-probability corner cases 
exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms 
with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to 
enable fully proper multiword synonyms - adding a notion of 'length' or 'span' 
to a token, this length should play together with positionIncrement when 
calculating distance between tokens in phrase/spannear queries.

> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-28 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: LUCENE-1606.patch

removed use of multitermquery's getTerm()

equals/hashcode are defined based upon the field and the language accepted by 
the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they 
are the same.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Shai Erera

I hope that I don't make a complete fool of myself, but I'm talking about
this:

  private List exceptions = new ArrayList();

and this (MergeThread.run()):

  synchronized(ConcurrentMergeScheduler.this) {
exceptions.add(exc);
  }

Nothing seems to read this exceptions list, anywhere. That's what confused
me in the first place - it looks as if at some point saving those exceptions
was for a reason, but not anymore?

I see that you already fixed CMS to sleep for 250 ms (I'd add few lines that
explain why we do it) - thanks !

I wonder if we should remove this exceptions list? It's only accessed if an
exception is thrown, and therefore does not have any impact on performance
or anything (even though it syncs on CMS), but it's just confusing.

Shai

On Tue, Apr 28, 2009 at 4:59 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Apr 28, 2009 at 9:27 AM, Shai Erera  wrote:
> >> It's there so "anyUnhandledExceptions" can be called;
> >
> > I will check the code again, but I remember that after commenting it, the
> > only compile errors I saw were from MergeThread adding the exception ...
> > perhaps I'm missing something, so I'll re-check the code.
>
> The oal.util.LuceneTestCase wraps each test so that if any unhandled
> CMS exceptions happened during a test, the test fails.
>
> > I understand your point now - merging is an internal process to IW,
> > therefore there's no real user to notify on errors (e.g., even if IW knew
> > there is an error, what would it do exactly?), and I guess keep trying to
> > execute the merges is reasonable (while throwing the exceptions further -
> in
> > hope that the user code will catch those and do something with them).
>
> OK.
>
> > BTW, what will happen if I encounter such exceptions, that are thrown
> > repeatedly, and shutdown the JVM? I guess the index will not be in a
> corrupt
> > state, right? The next time I'll open it, it should be in a state prior
> to
> > the merge, or at least prior to the merge that failed?
>
> The index will never be corrupt (except for bugs ;) ) -- it will
> contain whatever the last commit was (which is typically well before
> the merges started).  EG w/ autoCommit=false, the index as of when you
> last called commit (or as of when you opened it, if you have not
> called commit) will remain intact on an abnormal shutdown.
>
> > I think that sleeping in case of exceptions makes sense .. in case of IO
> > errors that are temporary, this will not spawn threads endlessly, and
> > sleeping will give an opportunity for the IO problem to resolve. In case
> of
> > bugs, which are supposed to be detected during test time, it should give
> the
> > developer a chance to kill the process relatively quickly.
>
> OK I'll commit this.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Created: (LUCENE-1623) Back-compat break with non-ascii field names

2009-04-28 Thread Michael McCandless (JIRA)

Back-compat break with non-ascii field names


 Key: LUCENE-1623
 URL: https://issues.apache.org/jira/browse/LUCENE-1623
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1, 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9


If a field name contains non-ascii characters in a 2.3.x index, then
on upgrade to 2.4.x unexpected problems are hit.  It's possible to hit
a "read past EOF" IOException; it's also possible to not hit an
exception but get an incorrect field name.

This was caused by LUCENE-510, because the FieldInfos (*.fnm) file is
not properly versioned.

Spinoff from http://www.nabble.com/Read-past-EOF-td23276171.html


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1623) Back-compat break with non-ascii field names

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1623:
---

Attachment: LUCENE-1623.patch

Attached patch.  I plan to commit in a day or two, and back-port to
2.4.x branch.

I updated the back compat test to show the failure, and also
separately added 2.4 cases to the back-compat test.


> Back-compat break with non-ascii field names
> 
>
> Key: LUCENE-1623
> URL: https://issues.apache.org/jira/browse/LUCENE-1623
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4, 2.4.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1623.patch
>
>
> If a field name contains non-ascii characters in a 2.3.x index, then
> on upgrade to 2.4.x unexpected problems are hit.  It's possible to hit
> a "read past EOF" IOException; it's also possible to not hit an
> exception but get an incorrect field name.
> This was caused by LUCENE-510, because the FieldInfos (*.fnm) file is
> not properly versioned.
> Spinoff from http://www.nabble.com/Read-past-EOF-td23276171.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1617) Add "testpackage" to common-build.xml

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1617.


Resolution: Fixed

Thanks Shai!

> Add "testpackage" to common-build.xml
> -
>
> Key: LUCENE-1617
> URL: https://issues.apache.org/jira/browse/LUCENE-1617
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch
>
>
> One can define "testcase" to execute just one test class, which is 
> convenient. However, I didn't notice any equivalent for testing a whole 
> package. I find it convenient to be able to test packages rather than test 
> cases because often it is not so clear which test class to run.
> Following patch allows one to "ant test -Dtestpackage=search" (for example) 
> and run all tests under the \*/search/\* packages in core, contrib and tags, 
> or do "ant test-core -Dtestpackage=search" and execute similarly just for 
> core, or do "ant test-core -Dtestpacakge=lucene/search/function" and run all 
> the tests under \*/lucene/search/function/\* (just in case there is another 
> o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread patrick o'leary

hey

I've got a filter that's storing document id's with a geo distance for
spatial lucene using a bitset position for doc id,
However with a MultiSegmentReader that's no longer going to working.

What's the most appropriate way to go from bitset position to doc id now?

Thanks
Patrick

[jira] Resolved: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1604.


Resolution: Fixed

Thanks Shon!

> Stop creating huge arrays to represent the absense of field norms
> -
>
> Key: LUCENE-1604
> URL: https://issues.apache.org/jira/browse/LUCENE-1604
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
>Reporter: Shon Vella
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch, 
> LUCENE-1604.patch
>
>
> Creating and keeping around huge arrays that hold a constant value is very 
> inefficient both from a heap usage standpoint and from a localility of 
> reference standpoint. It would be much more efficient to use null to 
> represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread Uwe Schindler

What is the problem exactly? Maybe you use the new Collector API, where the
search is done for each segment, so caching does not work correctly?

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: patrick o'leary [mailto:pj...@pjaol.com] 
Sent: Tuesday, April 28, 2009 10:31 PM
To: java-dev@lucene.apache.org
Subject: ReadOnlyMultiSegmentReader bitset id vs doc id

 

hey

I've got a filter that's storing document id's with a geo distance for
spatial lucene using a bitset position for doc id,
However with a MultiSegmentReader that's no longer going to working.

What's the most appropriate way to go from bitset position to doc id now?

Thanks
Patrick

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Michael McCandless

On Tue, Apr 28, 2009 at 4:00 PM, Shai Erera  wrote:
> I hope that I don't make a complete fool of myself, but I'm talking about
> this:
>
>   private List exceptions = new ArrayList();
>
> and this (MergeThread.run()):
>
>   synchronized(ConcurrentMergeScheduler.this) {
>     exceptions.add(exc);
>   }
>
> Nothing seems to read this exceptions list, anywhere. That's what confused
> me in the first place - it looks as if at some point saving those exceptions
> was for a reason, but not anymore?

Whoa, you're right!  This is completely dead code.  I will remove.
Thanks for persisting ;)

> I see that you already fixed CMS to sleep for 250 ms (I'd add few lines that
> explain why we do it) - thanks !

OK will do.

> I wonder if we should remove this exceptions list? It's only accessed if an
> exception is thrown, and therefore does not have any impact on performance
> or anything (even though it syncs on CMS), but it's just confusing.

Yup I'll remove it.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread Mark Miller

You might check out this Solr exchange : 
http://www.lucidimagination.com/search/document/b2ccc68ca834129/lucene_2_9_migration_issues_multireader_vs_indexreader_document_ids


There are a few suggestions throughout.


--
- Mark

http://www.lucidimagination.com



Uwe Schindler wrote:


What is the problem exactly? Maybe you use the new Collector API, 
where the search is done for each segment, so caching does not work 
correctly?


 


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



*From:* patrick o'leary [mailto:pj...@pjaol.com]
*Sent:* Tuesday, April 28, 2009 10:31 PM
*To:* java-dev@lucene.apache.org
*Subject:* ReadOnlyMultiSegmentReader bitset id vs doc id

 


hey

I've got a filter that's storing document id's with a geo distance for 
spatial lucene using a bitset position for doc id,

However with a MultiSegmentReader that's no longer going to working.

What's the most appropriate way to go from bitset position to doc id now?

Thanks
Patrick







-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1616.


Resolution: Fixed

Thanks Eks!

> add one setter for start and end offset to OffsetAttribute
> --
>
> Key: LUCENE-1616
> URL: https://issues.apache.org/jira/browse/LUCENE-1616
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
> LUCENE-1616.patch
>
>
> add OffsetAttribute. setOffset(startOffset, endOffset);
> trivial change, no JUnit needed
> Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread patrick o'leary

Think I may have found it, it was multiple runs of the filter, one for each
segment reader, I was generating a new map to hold distances each time. So
only the distances from the
last segment reader were stored.

Currently it looks like those segmented searches are done serially, well in
solr they are-
I presume the end goal is to make them multi-threaded ?
I'll need to make my map synchronized

On Tue, Apr 28, 2009 at 4:42 PM, Uwe Schindler  wrote:

>  What is the problem exactly? Maybe you use the new Collector API, where
> the search is done for each segment, so caching does not work correctly?
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>   --
>
> *From:* patrick o'leary [mailto:pj...@pjaol.com]
> *Sent:* Tuesday, April 28, 2009 10:31 PM
> *To:* java-dev@lucene.apache.org
> *Subject:* ReadOnlyMultiSegmentReader bitset id vs doc id
>
>
>
> hey
>
> I've got a filter that's storing document id's with a geo distance for
> spatial lucene using a bitset position for doc id,
> However with a MultiSegmentReader that's no longer going to working.
>
> What's the most appropriate way to go from bitset position to doc id now?
>
> Thanks
> Patrick
>

[jira] Resolved: (LUCENE-1620) Ho w to index and Search the special characters as well as non-englis h characters like danish Å,ø,etc

2009-04-28 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved LUCENE-1620.
--

Resolution: Invalid

Uday: please subscribe to the java-user mailing list and post your questions 
about using Lucene to that list -- Jira is not a discussion forum.

> How to index and Search the special characters as well as non-english 
> characters like danish  Å,ø,etc
> -
>
> Key: LUCENE-1620
> URL: https://issues.apache.org/jira/browse/LUCENE-1620
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Index, QueryParser, Search
>Affects Versions: 2.4.1
> Environment: windows xp,jdk 6
>Reporter: uday kumar maddigatla
> Attachments: IndexFiles.java, SearchFiles.java
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi 
> I just started to use lucene. I found one thing un usal. In my documents few 
> contains english ,xml elements  as well as danish elements.
> I just used StandardAnalyzer to index the documents and i found that this 
> analyzer is not recognizing the special characers as well as danish elements.
> So how can i use lucene in order to index the special characters like '(', 
> ')', ':' etc and also danish elements like Å,ø,etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703850#action_12703850
 ] 

Jason Rutherglen commented on LUCENE-1618:
--

{quote}For an NRT writer using RAMDir for recently flushed tiny
segments (LUCENE-1313), the gains are more about the speed of
reading/writing many tiny files. Probably we should try
[somehow] to test this case, to see if LUCENE-1313 is even a
worthwhile optimization.{quote}

True a test would be good, how many files per second would it
produce?

When testing the realtime and the .del files (which are created
numerously before LUCENE-1516) the slowdown was quite dramatic
as it's not a sequential write which means the disk head can
move each time. That coupled with merges going on which
completely ties up the IO I think it's hard for small file
writes to not slow down with a rapidly updating index. 

An index that is being updated rapidly presumably would be
performing merges more often to remove deletes. 

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-28 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703853#action_12703853
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote}EG when RAM is full, we want to quickly flush it to disk
as a single segment. Merging with disk segments only makes that
flush slower?{quote}

I assume it's ok for the IW.mergescheduler to be used which may
not immediately perform the merge to disk (in the case of
ConcurrentMergeScheduler)? When implementing using
addIndexesNoOptimize (which blocks) I realized we probably don't
want blocking to occur because that means shutting down the
updates. 

Also a random thought, it seems like ConcurrentMergeScheduler
works great for RAMDir merging, how does it compare with
SerialMS on an FSDirectory? It seems like it shouldn'y be too much
faster given the IO sequential access bottleneck?

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703855#action_12703855
 ] 

Jason Rutherglen commented on LUCENE-1618:
--

{quote}One downside to this approach is it's brittle - whenever
we change file extensions you'd have to "know" to fix this
Directory.{quote}

True, I don't think we can expect the user to pass in the
correct FileSwitchDirectory (with the attendant file
extensions), we can make the particular implementation of
Directory we use to solve this problem internal to IW. Meaning
the writer can pass through the real directory calls to FSD, and
handle the RAMDir calls on it's own. 

> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Fwd: Build failed in Hudson: Lucene-trunk #810

2009-04-28 Thread Andi Vajda



On Tue, 28 Apr 2009, Michael McCandless wrote:


Hmm -- this failed because the host "downloads.osafoundation.org"
fails to resolve.  The contrib/db tests need to download the Berkeley
DB JARs from here.

Andi any idea what's up w/ that?  Do we need to set a different
download location?


It should be back now. Sorry about the delay in responding, mail was down 
too. OSAF's colo had a power outage, it seems, and some machines wouldn't 
come back up.


Andi..



Mike

-- Forwarded message --
From: Apache Hudson Server 
Date: Mon, Apr 27, 2009 at 10:14 PM
Subject: Build failed in Hudson: Lucene-trunk #810
To: java-dev@lucene.apache.org


See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/810/changes

Changes:

[mikemccand] LUCENE-1615: remove some more deprecated uses of Fieldable.omitTf

[mikemccand] remove redundant CHANGES entries from trunk if they are
already covered in 2.4.1

--
[...truncated 2887 lines...]
compile-test:
    [echo] Building benchmark...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-demo:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile-demo:

compile-highlighter:
    [echo] Building highlighter...

build-memory:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile:

check-files:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
   [javac] Compiling 9 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
   [javac] Note:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/quality/TestQualityRun.java
 uses or overrides a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.
    [copy] Copying 2 files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test

build-artifacts-and-tests:
    [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
    [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
   [javac] Compiling 16 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
   [javac] Note: Some input files use or override a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
   [javac] Compiling 4 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
   [javac] Note: Some input files use or override a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

jar-core:
     [jar] Building jar:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/lucene-collation-2.4-SNAPSHOT.jar

jar:

compile-test:
    [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
    [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
   [mkdir] Created dir:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
   [javac] Compiling 5 source files to
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
   [javac] Note:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java
 uses or overrides a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.

build-artifacts-and-tests:

bdb:
    [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

contrib-build.init:

get-db-jar:
   [mkdir] Creat

Re: ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread Mark Miller

I'm not sure that we could parallelize it. Currently, its a serial 
process (as you say) - the queue collects across readers by adjusting 
the values in the queue to sort correctly against the current reader. 
That approach doesn't appear easily parallelized.


patrick o'leary wrote:
Think I may have found it, it was multiple runs of the filter, one for 
each segment reader, I was generating a new map to hold distances each 
time. So only the distances from the

last segment reader were stored.

Currently it looks like those segmented searches are done serially, 
well in solr they are-

I presume the end goal is to make them multi-threaded ?
I'll need to make my map synchronized


On Tue, Apr 28, 2009 at 4:42 PM, Uwe Schindler > wrote:


What is the problem exactly? Maybe you use the new Collector API,
where the search is done for each segment, so caching does not
work correctly?

 


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de 



*From:* patrick o'leary [mailto:pj...@pjaol.com
]
*Sent:* Tuesday, April 28, 2009 10:31 PM
*To:* java-dev@lucene.apache.org 
*Subject:* ReadOnlyMultiSegmentReader bitset id vs doc id

 


hey

I've got a filter that's storing document id's with a geo distance
for spatial lucene using a bitset position for doc id,
However with a MultiSegmentReader that's no longer going to working.

What's the most appropriate way to go from bitset position to doc
id now?

Thanks
Patrick





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-28 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1618:
-

Attachment: LUCENE-1618.patch

Implementation of the FileSwitchDirectory. It's nice this works
so elegantly with the existing Lucene APIs.

The test case makes sure the fdt and fdx files are written to
the fsdirectory based on the files extensions. I feel that
LUCENE-1313 will depend on this and I'll implement LUCENE-1313
with this patch in mind. I'm not sure how we insure there are no
file name collisions between the real dir and FSD? Because IW is
managing the creation of the segment names I don't think we
need to worry about this.





> Allow setting the IndexWriter docstore to be a different directory
> --
>
> Key: LUCENE-1618
> URL: https://issues.apache.org/jira/browse/LUCENE-1618
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1618.patch, MemoryCachedDirectory.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add an IndexWriter.setDocStoreDirectory method that allows doc
> stores to be placed in a different directory than the IW default
> dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler may spawn MergeThreads forever

2009-04-28 Thread Shai Erera

Thanks !

On Tue, Apr 28, 2009 at 11:48 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Apr 28, 2009 at 4:00 PM, Shai Erera  wrote:
> > I hope that I don't make a complete fool of myself, but I'm talking about
> > this:
> >
> >   private List exceptions = new ArrayList();
> >
> > and this (MergeThread.run()):
> >
> >   synchronized(ConcurrentMergeScheduler.this) {
> > exceptions.add(exc);
> >   }
> >
> > Nothing seems to read this exceptions list, anywhere. That's what
> confused
> > me in the first place - it looks as if at some point saving those
> exceptions
> > was for a reason, but not anymore?
>
> Whoa, you're right!  This is completely dead code.  I will remove.
> Thanks for persisting ;)
>
> > I see that you already fixed CMS to sleep for 250 ms (I'd add few lines
> that
> > explain why we do it) - thanks !
>
> OK will do.
>
> > I wonder if we should remove this exceptions list? It's only accessed if
> an
> > exception is thrown, and therefore does not have any impact on
> performance
> > or anything (even though it syncs on CMS), but it's just confusing.
>
> Yup I'll remove it.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Hudson build is back to normal: Lucene-trunk #811

2009-04-28 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/811/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ReadOnlyMultiSegmentReader bitset id vs doc id

2009-04-28 Thread patrick o'leary

Ok finally with some pointers from Ryan, figured out the last problem.
So as a note to anyone else who might encounter the same problems with
multireader

A) Directories can contain multiple segments and a reader for those segments
B) Searches are replayed within each reader in a serial fashion **
C) If utilizing FieldCache / BitSet or anything related to document position
within a reader, and you need docId
   -- document id = (sum of previous reader maxdocs )+ bitset position

e.g.
int offset;
int nextOffset;

public DocIdSet getDocIdSet(IndexReader reader) {

   OpenBitSet bitset = new OpenBitSet(reader.maxDoc());
   offset += reader.maxDoc();
   for (int i =0; i reader.maxDoc(); i++)  {
.
 filter stuff 

if ( good ) {
   bitset.set( i );

   int docId = i + nextOffset;
   ...
}
   }

  nextOffset += offset;
  ...
}


K, works time for sleep

P


On Tue, Apr 28, 2009 at 5:44 PM, patrick o'leary  wrote:

> Think I may have found it, it was multiple runs of the filter, one for each
> segment reader, I was generating a new map to hold distances each time. So
> only the distances from the
> last segment reader were stored.
>
> Currently it looks like those segmented searches are done serially, well in
> solr they are-
> I presume the end goal is to make them multi-threaded ?
> I'll need to make my map synchronized
>
>
> On Tue, Apr 28, 2009 at 4:42 PM, Uwe Schindler  wrote:
>
>>  What is the problem exactly? Maybe you use the new Collector API, where
>> the search is done for each segment, so caching does not work correctly?
>>
>>
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>   --
>>
>> *From:* patrick o'leary [mailto:pj...@pjaol.com]
>> *Sent:* Tuesday, April 28, 2009 10:31 PM
>> *To:* java-dev@lucene.apache.org
>> *Subject:* ReadOnlyMultiSegmentReader bitset id vs doc id
>>
>>
>>
>> hey
>>
>> I've got a filter that's storing document id's with a geo distance for
>> spatial lucene using a bitset position for doc id,
>> However with a MultiSegmentReader that's no longer going to working.
>>
>> What's the most appropriate way to go from bitset position to doc id now?
>>
>> Thanks
>> Patrick
>>
>
>

74 matches

Mail list logo