[jira] Updated: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2455:
---

Attachment: LUCENE-2455_3x.patch

Patch applies Mike's comments. I think this is ready to go in. I'd like to 
commit to 3x before trunk, because there are lots of changes here.

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch, LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871023#action_12871023
 ] 

Shai Erera commented on LUCENE-2455:


bq. CFW's comment should be "make it 1 lower"

Right ! I copied it from FieldsWriter where the versions are kept as positive 
ints. Will post a patch shortly.

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1870) Binary Update Request (javabin) fails when the field type of a multivalued SolrInputDocument field is a Set (or any type that is identified as an instance of iterable)

2010-05-24 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1870:
-

Attachment: SOLR-1870.patch

fixing JavabinCodec to write collection as array

> Binary Update Request (javabin) fails when the field type of a multivalued 
> SolrInputDocument field is a Set (or any type that is identified as an 
> instance of iterable) 
> 
>
> Key: SOLR-1870
> URL: https://issues.apache.org/jira/browse/SOLR-1870
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, update
>Affects Versions: 1.4
>Reporter: Prasanna Ranganathan
> Attachments: SOLR-1870-test.patch, SOLR-1870.patch, SOLR-1870.patch
>
>
> When the field type of a field in a SolrInputDocument is a Collection based 
> on the Set interface, the JavaBinUpdate request fails. It works when sending 
> the document data over XML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1870) Binary Update Request (javabin) fails when the field type of a multivalued SolrInputDocument field is a Set (or any type that is identified as an instance of iterable)

2010-05-24 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871001#action_12871001
 ] 

Noble Paul commented on SOLR-1870:
--

bq. "top level" there will be an Iterator of docs, so it overwrites that method 
in this way, ignorant of the possibility that there might be other purposes for 
an Iterator in the stream (like a Set of values for a field)

Iterator was created as a special type in javabin codec so that items can be 
streamed. Any collection should have been written as a List of specific size. 
When unmarshalled , the items always come out as a List unless we override the 
readIterator.
.I guess that  a better fix would be to write Collection as List I shall give a 
patch

> Binary Update Request (javabin) fails when the field type of a multivalued 
> SolrInputDocument field is a Set (or any type that is identified as an 
> instance of iterable) 
> 
>
> Key: SOLR-1870
> URL: https://issues.apache.org/jira/browse/SOLR-1870
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, update
>Affects Versions: 1.4
>Reporter: Prasanna Ranganathan
> Attachments: SOLR-1870-test.patch, SOLR-1870.patch
>
>
> When the field type of a field in a SolrInputDocument is a Collection based 
> on the Set interface, the JavaBinUpdate request fails. It works when sending 
> the document data over XML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



DocBuilder inefficiency?

2010-05-24 Thread Robert Zotter

I am looking into collectDelta method in DocBuilder.java and I noticed that
to determine the deltaRemoveSet it currently loops through the whole
deltaSet for each deleted row. (Version 1.4.0 line 641)

Does anyone else agree with the fact that this is quite inefficient?

For delta-imports with a large deltaSet and deletedSet I found a
considerable improvement in speed if we just save all deleted keys in a set.
Then we just have to loop through the deltaSet once to determine which rows
should be removed by checking if the deleted key set contains the delta row
key.

Is this patch worthy?

- Robert Zotter
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DocBuilder-inefficiency-tp841272p841272.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1923) add caverphone to phoneticfilter

2010-05-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1923:
--

Attachment: SOLR-1923.patch

> add caverphone to phoneticfilter
> 
>
> Key: SOLR-1923
> URL: https://issues.apache.org/jira/browse/SOLR-1923
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.1
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1923.patch
>
>
> we upgraded commons-codec but didn't add this new one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-1923) add caverphone to phoneticfilter

2010-05-24 Thread Robert Muir (JIRA)
add caverphone to phoneticfilter


 Key: SOLR-1923
 URL: https://issues.apache.org/jira/browse/SOLR-1923
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 3.1, 4.0


we upgraded commons-codec but didn't add this new one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



NPE Within IndexWriter.optimize (Solr Trunk Nightly)

2010-05-24 Thread Chris Herron
Hi,

I'm using the latest nightly build of solr (apache-solr-2010-05-24_08-05-13) 
and am repeatedly experiencing a NullPointerException after calling delete, 
commit, optimize. Stack trace below. The index is ~20Gb.

I'm not doing Lucene/Solr core development - I just figured this was a better 
place to ask given that this was a nightly build.

Any observations that would help resolve?

Thanks,

Chris

SEVERE: java.io.IOException: background merge hit exception: _gr5a:C127 
_gsbj:C486/3 _gsbk:C1 _gsbl:C1/1 _gsbm:C1 _gsbn:C1 _gsbo:C1 _gsbp:C1 _gsbq:C1 
_gssn:C69 into _gsss [optimize] [mergeDocStores]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2418)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2343)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:403)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
at 
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1190)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:424)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:457)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:229)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:931)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:361)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:867)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:245)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
at org.eclipse.jetty.server.Server.handle(Server.java:337)
at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:581)
at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1005)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:560)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:222)
at 
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:417)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:474)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:437)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at 
org.apache.lucene.index.codecs.preflex.TermInfosReader.seekEnum(TermInfosReader.java:224)
at 
org.apache.lucene.index.codecs.preflex.TermInfosReader.seekEnum(TermInfosReader.java:214)
at 
org.apache.lucene.index.codecs.preflex.PreFlexFields$PreTermsEnum.reset(PreFlexFields.java:251)
at 
org.apache.lucene.index.codecs.preflex.PreFlexFields$PreFlexFieldsEnum.terms(PreFlexFields.java:198)
at 
org.apache.lucene.index.MultiFieldsEnum.terms(MultiFieldsEnum.java:103)
at 
org.apache.lucene.index.codecs.FieldsConsumer.merge(FieldsConsumer.java:48)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:647)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:151)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4414)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4038)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)

Exception in thread "Lucene Merge Thread #0" 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.NullPointerException
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run

[jira] Updated: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2380:
---

Attachment: LUCENE-2380.patch

New iteration attached.

I got Solr mostly cutover, at least for the immediate usage of 
FieldCache.getStrings/getStringIndex.

However one Solr test (TestDistributedSearch) is still failing...

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread karl.wright
The reason for this is simple.  LCF keeps track of which documents it has 
handed off to Solr, and has a fairly involved mechanism for making sure that 
every document LCF *thinks* got there, actually does.  It even uses a mechanism 
akin to a 2-phase commit to make sure that its internal records and those of 
the downstream index are never out of synch.

Now, along comes Solr, and the system loses a good deal of its resilience, 
because there is a chance that somebody or something will kick Solr after a 
document (or a set of documents) has been transmitted to it, but LCF will have 
no awareness of this situation at all, and will thus never try to fix the 
problem on the next job run (or whatever).  So instead of automatic resilience, 
you get one of two possible solutions:

(1) Manual intervention.  Somebody has to manually inform LCF of the Solr 
hiccup, and LCF thus will have to invalidate all documents it ever sent to Solr 
(because it doesn't know what documents could have been affected).
(2) A solr commit on every post.  This slows down LCF significantly, because 
each document post takes something like 10x as long to do.

Does this help?
Karl

-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, May 24, 2010 4:40 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Indexing a doc won't be as fast as raw disk IO. But you won't be doing 
just raw disk IO to guarantee acceptance. And that will have a cost and 
complexity that really makes me wonder if its worth the speed advantage. 
For very large documents with complex analyzers...perhaps. But its not 
going to be an easily implementable feature (if its a true guarantee). 
And its still got to involve logs and/or fsync and all that.

The reasoning for this is not ringing a bell - can you elaborate on the 
motivations?

Is this so that you can commit on every doc? Every few docs?

I can def see how this would be desirable in general, but just to be 
clear on your motivations.


- Mark

On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this would obviously need to be an optional mode.
>


-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
T

RE: .Net, Lucene and IKVM

2010-05-24 Thread Digy
This is an unresolved old topic.
http://www.mail-archive.com/lucene-net-u...@incubator.apache.org/msg00872.html

DIGY



-Original Message-
From: Andrzej Bialecki [mailto:a...@getopt.org] 
Sent: Tuesday, May 25, 2010 12:32 AM
To: dev@lucene.apache.org
Subject: .Net, Lucene and IKVM

Hi all,

I'm glad to report that I was able to compile Lucene branch_3x with a
recent snapshot of IKVM, and after trying out the Lucene demo apps both
the IndexFiles and SearchFiles applications appear to run flawlessly.
Environment is WinXP/SP2, .Net CLR 2.0, 3.0, 3.5, and IKVM downloaded
from http://www.frijters.net/ikvmbin-0.43.3790.zip.

Now, I'm going to crawl back under my Java rock ... I know nothing about
.Net, I just monkeyed around with the tools. If anyone's interested in
pursuing this further, be my guest.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



.Net, Lucene and IKVM

2010-05-24 Thread Andrzej Bialecki
Hi all,

I'm glad to report that I was able to compile Lucene branch_3x with a
recent snapshot of IKVM, and after trying out the Lucene demo apps both
the IndexFiles and SearchFiles applications appear to run flawlessly.
Environment is WinXP/SP2, .Net CLR 2.0, 3.0, 3.5, and IKVM downloaded
from http://www.frijters.net/ikvmbin-0.43.3790.zip.

Now, I'm going to crawl back under my Java rock ... I know nothing about
.Net, I just monkeyed around with the tools. If anyone's interested in
pursuing this further, be my guest.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870846#action_12870846
 ] 

Robert Muir commented on LUCENE-2413:
-

By the way, one idea could be to make benchmark a module itself (the 
benchmarking module for all lucene/solr related stuff).

I noticed Solr lacks a standard benchmarking suite, and at the same time more 
benchmarks are being created even for
contribs/modules (highlighter, analyzers)


> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> ---
>
> Key: LUCENE-2413
> URL: https://issues.apache.org/jira/browse/LUCENE-2413
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael McCandless
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, 
> LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch, 
> LUCENE-2413_htmlstrip.patch, LUCENE-2413_icu.patch, 
> LUCENE-2413_keep_hyphen_trim.patch, LUCENE-2413_keyword.patch, 
> LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, 
> LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch, 
> LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, 
> LUCENE-2413_teesink.patch, LUCENE-2413_test4.patch, 
> LUCENE-2413_testanalyzer.patch, LUCENE-2413_testanalyzer.patch, 
> LUCENE-2413_tests2.patch, LUCENE-2413_tests3.patch, LUCENE-2413_wdf.patch
>
>
> We've been wanting to do this for quite some time now...  I think, now that 
> Solr/Lucene are merged, and we're looking at opening an unstable line of 
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately 
> version the analyzers from which version of Solr/Lucene they use, possibly 
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from 
> the analysis API), but I don't think that issue needs to block this 
> consolidation.
> Once we do this, there is one place where our users can find all the 
> analyzers that Solr/Lucene provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Andrzej Bialecki as Lucene/Solr committer

2010-05-24 Thread Yonik Seeley
On Mon, May 24, 2010 at 5:33 AM, Michael McCandless
 wrote:
> I'm happy to announce that the PMC has accepted Andrzej Bialecki as
> Lucene/Solr committer!
>
> Welcome aboard Andrzej,

An enthusiastic jet lagged +1 ;-)

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870843#action_12870843
 ] 

Robert Muir commented on LUCENE-2413:
-

{quote}
contrib/benchmark's NewShingleAnalyzerTask depends on modules' 
o.a.l.analysis.shingle.ShingleAnalyzerWrapper - causing cyclic dependency 
between projects - e.g. when creating separate Eclipse projects for lucene and 
modules.
{quote}

Hi, its not a cyclic dependency, as the analyzers module only depends on core 
lucene. 

If you want to have separate projects I would make the contribs separate, too, 
or put everything in one eclipse project (this is what I prefer).


> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> ---
>
> Key: LUCENE-2413
> URL: https://issues.apache.org/jira/browse/LUCENE-2413
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael McCandless
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, 
> LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch, 
> LUCENE-2413_htmlstrip.patch, LUCENE-2413_icu.patch, 
> LUCENE-2413_keep_hyphen_trim.patch, LUCENE-2413_keyword.patch, 
> LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, 
> LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch, 
> LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, 
> LUCENE-2413_teesink.patch, LUCENE-2413_test4.patch, 
> LUCENE-2413_testanalyzer.patch, LUCENE-2413_testanalyzer.patch, 
> LUCENE-2413_tests2.patch, LUCENE-2413_tests3.patch, LUCENE-2413_wdf.patch
>
>
> We've been wanting to do this for quite some time now...  I think, now that 
> Solr/Lucene are merged, and we're looking at opening an unstable line of 
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately 
> version the analyzers from which version of Solr/Lucene they use, possibly 
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from 
> the analysis API), but I don't think that issue needs to block this 
> consolidation.
> Once we do this, there is one place where our users can find all the 
> analyzers that Solr/Lucene provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis

2010-05-24 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870842#action_12870842
 ] 

Doron Cohen commented on LUCENE-2413:
-

contrib/benchmark's NewShingleAnalyzerTask depends on modules' 
o.a.l.analysis.shingle.ShingleAnalyzerWrapper - causing cyclic dependency 
between projects - e.g. when creating separate Eclipse projects for lucene and 
modules. 

> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> ---
>
> Key: LUCENE-2413
> URL: https://issues.apache.org/jira/browse/LUCENE-2413
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael McCandless
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, 
> LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch, 
> LUCENE-2413_htmlstrip.patch, LUCENE-2413_icu.patch, 
> LUCENE-2413_keep_hyphen_trim.patch, LUCENE-2413_keyword.patch, 
> LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, 
> LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch, 
> LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, 
> LUCENE-2413_teesink.patch, LUCENE-2413_test4.patch, 
> LUCENE-2413_testanalyzer.patch, LUCENE-2413_testanalyzer.patch, 
> LUCENE-2413_tests2.patch, LUCENE-2413_tests3.patch, LUCENE-2413_wdf.patch
>
>
> We've been wanting to do this for quite some time now...  I think, now that 
> Solr/Lucene are merged, and we're looking at opening an unstable line of 
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately 
> version the analyzers from which version of Solr/Lucene they use, possibly 
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from 
> the analysis API), but I don't think that issue needs to block this 
> consolidation.
> Once we do this, there is one place where our users can find all the 
> analyzers that Solr/Lucene provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Mark Miller
Indexing a doc won't be as fast as raw disk IO. But you won't be doing 
just raw disk IO to guarantee acceptance. And that will have a cost and 
complexity that really makes me wonder if its worth the speed advantage. 
For very large documents with complex analyzers...perhaps. But its not 
going to be an easily implementable feature (if its a true guarantee). 
And its still got to involve logs and/or fsync and all that.


The reasoning for this is not ringing a bell - can you elaborate on the 
motivations?


Is this so that you can commit on every doc? Every few docs?

I can def see how this would be desirable in general, but just to be 
clear on your motivations.



- Mark

On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:

Hi Mark,

Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
committing on every post.

If your guess is correct, you are basically saying that adding a document to an 
index in Solr/Lucene is just as fast as writing that file directly to the disk. 
 Because, obviously, if we want guaranteed delivery, that's what we'd have to 
do.  But I think this is worth the experiment - Solr/Lucene may be fast, but I 
have doubts that it can perform as well as raw disk I/O and still manage to do 
anything in the way of document analysis or (heaven forbid) text extraction.



-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com]
Sent: Monday, May 24, 2010 3:33 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:

Hi all,
It seems to me that the "commit" logic in the Solr updateRequestHandler
(or wherever the logic is actually located) conflates two different
semantics. One semantic is what you need to do to make the index process
perform well. The other semantic is guaranteed atomicity of document
reception by Solr.
In particular, it would be nice to be able to post documents in such a
way that you can guarantee that the document is permanently in Solr's
queue, safe in the event of a Solr restart, etc., even if the document
has not yet been "committed".
This issue came up in the LCF talk that I gave, and I initially thought
that separating the two kinds of events would necessarily be an LCF
change, but the more I thought about it the more I realized that other
Solr indexing clients may also benefit from such a separation.
Does anyone agree? Where should this logic properly live?
Thanks,
Karl


Its an interesting idea - but I think you would likely pay a similar
cost to guarantee reception as you would to commit (also, I'm not sure
Lucene guarantees it - it works for consistency, but I'm not so sure it
achieves durability).

I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance -
then index from the store on the commit.

Another simpler idea if only the separation is important and not the
performance - index to another side index, taking advantage of Lucene's
current commit functionality, and then use addIndex to merge to the main
index on commit.

Just spit balling though.

I think this would obviously need to be an optional mode.




--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Simon Willnauer
Hi Karl,

what are you describing seems to be a good usecase for something like
a message queue where you push a document or record to a queue which
guarantees the queues persistence. I look at this from a little
different perspective, in a distributed environment you would have to
guarantee delivery to a single solr instance but on several or at
least n instances but that is a different story.

>From a Solr point of view this sounds like a need for a write-ahead
log that guarantees durability and atomicity. I like this idea as it
might also solve lots of problems in distributed environments (solr
cloud) etc.

Very interesting topic - should investigate more in this direction


simon


On Mon, May 24, 2010 at 10:03 PM,   wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this would obviously need to be an optional mode.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread karl.wright
Hi Mark,

Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
committing on every post.

If your guess is correct, you are basically saying that adding a document to an 
index in Solr/Lucene is just as fast as writing that file directly to the disk. 
 Because, obviously, if we want guaranteed delivery, that's what we'd have to 
do.  But I think this is worth the experiment - Solr/Lucene may be fast, but I 
have doubts that it can perform as well as raw disk I/O and still manage to do 
anything in the way of document analysis or (heaven forbid) text extraction.



-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, May 24, 2010 3:33 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
> Hi all,
> It seems to me that the "commit" logic in the Solr updateRequestHandler
> (or wherever the logic is actually located) conflates two different
> semantics. One semantic is what you need to do to make the index process
> perform well. The other semantic is guaranteed atomicity of document
> reception by Solr.
> In particular, it would be nice to be able to post documents in such a
> way that you can guarantee that the document is permanently in Solr's
> queue, safe in the event of a Solr restart, etc., even if the document
> has not yet been "committed".
> This issue came up in the LCF talk that I gave, and I initially thought
> that separating the two kinds of events would necessarily be an LCF
> change, but the more I thought about it the more I realized that other
> Solr indexing clients may also benefit from such a separation.
> Does anyone agree? Where should this logic properly live?
> Thanks,
> Karl

Its an interesting idea - but I think you would likely pay a similar 
cost to guarantee reception as you would to commit (also, I'm not sure 
Lucene guarantees it - it works for consistency, but I'm not so sure it 
achieves durability).

I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance - 
then index from the store on the commit.

Another simpler idea if only the separation is important and not the 
performance - index to another side index, taking advantage of Lucene's 
current commit functionality, and then use addIndex to merge to the main 
index on commit.

Just spit balling though.

I think this would obviously need to be an optional mode.

-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Mark Miller

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:

Hi all,
It seems to me that the “commit” logic in the Solr updateRequestHandler
(or wherever the logic is actually located) conflates two different
semantics. One semantic is what you need to do to make the index process
perform well. The other semantic is guaranteed atomicity of document
reception by Solr.
In particular, it would be nice to be able to post documents in such a
way that you can guarantee that the document is permanently in Solr’s
queue, safe in the event of a Solr restart, etc., even if the document
has not yet been “committed”.
This issue came up in the LCF talk that I gave, and I initially thought
that separating the two kinds of events would necessarily be an LCF
change, but the more I thought about it the more I realized that other
Solr indexing clients may also benefit from such a separation.
Does anyone agree? Where should this logic properly live?
Thanks,
Karl


Its an interesting idea - but I think you would likely pay a similar 
cost to guarantee reception as you would to commit (also, I'm not sure 
Lucene guarantees it - it works for consistency, but I'm not so sure it 
achieves durability).


I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance - 
then index from the store on the commit.


Another simpler idea if only the separation is important and not the 
performance - index to another side index, taking advantage of Lucene's 
current commit functionality, and then use addIndex to merge to the main 
index on commit.


Just spit balling though.

I think this would obviously need to be an optional mode.

--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2458:


Attachment: LUCENE-2458.patch

updated patch that cuts over the remaining two qps: the flexible queryparser 
and precedence queryparser

> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Blocker
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch
>
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



mingw /implib:foo.lib equivalent ?

2010-05-24 Thread Andi Vajda


 Hi Bill,

Would you know what the equivalent mingw gcc flag for MSVC's /implib:foo.lib 
flag is ?
This overrides the default name and location that the linker uses to produce 
a DLLs' import library.


I added some linking tricks on Windows and Linux for supporting the new 
--import funtionality and it seems that hardcoding /implib: is not going to 
work well on mingw or will it ?


Andi..


[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870761#action_12870761
 ] 

Shai Erera commented on LUCENE-2455:


I will document it in CHANGES under API section. I think the migration guide 
format will need its own discussion, and I don't want to block that issue. When 
we've agreed on the format (people have made few suggestions), I don't mind 
helping w/ porting everything relevant from changes to that guide.

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: TestBackwardsCompatibility

2010-05-24 Thread Shai Erera
Oops :z = 3x

Shai

On Monday, May 24, 2010, Shai Erera  wrote:
> So do we want to just remove the 1x indexes from :z and 2x from trunk?
> Or do we also want to remove the live migration code? How can one
> start with that for example? Are there constants to look for for
> example?
>
> Shai
>
> On Monday, May 24, 2010, Mark Miller  wrote:
>> On 5/24/10 11:25 AM, Michael McCandless wrote:
>>
>> Yes, I think we can remove support for 1.9 indexes as of 3.0:
>>
>>      http://wiki.apache.org/lucene-java/BackwardsCompatibility
>>
>> So starting with 3.0 the oldest index we must support are those written by 
>> 2.0.
>>
>> Mike
>>
>> On Sun, May 23, 2010 at 12:56 AM, Shai Erera  wrote:
>>
>> Hi
>>
>> I'm working on adding support for addIndexes* in TestBackwardsCompatibility,
>> and I've noticed it still reads 1.9 indexes. Is that intentional? Shouldn't
>> 3x stop supporting 1.9?
>>
>> Shai
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>>
>> We really need to update that wiki page - mucho changes.
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870743#action_12870743
 ] 

Michael McCandless commented on LUCENE-2455:


bq. With that behind us, did someone start an API migration guide?

Not yet, I think?  Go for it!

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870735#action_12870735
 ] 

Shai Erera commented on LUCENE-2455:


Ahh, I knew we must be talking past each other :). I assumed that the flex 
changes will go under the migration tool. If we have live migration for it, 
then I agree we should do live migration here.

With that behind us, did someone start an API migration guide? Since I remove 
addIndexesnoOptimize in favor of the new addIndexes, I wanted to document it 
somewhere. It's a tiny change, so perhaps it can go other the API Changes in 
CHANGES?

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870724#action_12870724
 ] 

Michael McCandless commented on LUCENE-2455:


Sorry -- for each major release, I think it'll be either live
migration or offline migration, but not both.

So far for 4.0 we haven't had a major enough structural change to the
index format, that'd make live migration too hard/risky, so, so far I
think we can offer live migration for 4.0.

The biggest change was flex, but it has the preflex codec to read (not
write) the pre-4.0 format... so, so far I think we can still offer
live migration for 4.0?

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: TestBackwardsCompatibility

2010-05-24 Thread Shai Erera
So do we want to just remove the 1x indexes from :z and 2x from trunk?
Or do we also want to remove the live migration code? How can one
start with that for example? Are there constants to look for for
example?

Shai

On Monday, May 24, 2010, Mark Miller  wrote:
> On 5/24/10 11:25 AM, Michael McCandless wrote:
>
> Yes, I think we can remove support for 1.9 indexes as of 3.0:
>
>      http://wiki.apache.org/lucene-java/BackwardsCompatibility
>
> So starting with 3.0 the oldest index we must support are those written by 
> 2.0.
>
> Mike
>
> On Sun, May 23, 2010 at 12:56 AM, Shai Erera  wrote:
>
> Hi
>
> I'm working on adding support for addIndexes* in TestBackwardsCompatibility,
> and I've noticed it still reads 1.9 indexes. Is that intentional? Shouldn't
> 3x stop supporting 1.9?
>
> Shai
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> We really need to update that wiki page - mucho changes.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870688#action_12870688
 ] 

Shai Erera commented on LUCENE-2455:


I'm not sure about the live migration, Mike. First because all the problems 
I've mentioned about CodecUtils in 3x will apply to live migration of 3.x 
indexes in 4.0 code. Second, if everyone who upgrades to 4.0 will need to run 
the migration tool, then why do any work in supporting online migration? What's 
the benefit? Do u think of a case where someone upgrades to 4.0 w/o migrating 
his indexes (unless he reindexes of course, in which case there is no problem)?

I just think it's weird that we support online migration together w/ a 
migration tool. If we migrate the indexes w/ the tool to include the new format 
of CFS, then the online migration code won't ever run, right? And not doing 
this in the tool seems just a waste? I mean the user already migrates his 
indexes, so why incur the cost of an additional online migration?

> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870633#action_12870633
 ] 

Robert Muir commented on LUCENE-1622:
-

{quote}
We'd then need an AutomatonWordQuery - the same idea as
AutomatonQuery, except at the word level not at the character level.
{quote}

This is a cool idea, and on the analysis side a word-level automaton is really 
the datastructure I think we want for actually doing the multi-word synonym 
match efficiently (with minimal lookahead etc)


> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1852) enablePositionIncrements="true" can cause searches to fail when they are parsed as phrase queries

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870631#action_12870631
 ] 

Robert Muir commented on SOLR-1852:
---

Also, Mark mentioned to me he had concerns about 'index back-compat'.

Obviously, if we fix the bug, we 'break' this in the sense that you now index 
with correct positions...

> enablePositionIncrements="true" can cause searches to fail when they are 
> parsed as phrase queries
> -
>
> Key: SOLR-1852
> URL: https://issues.apache.org/jira/browse/SOLR-1852
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Peter Wolanin
>Assignee: Robert Muir
> Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch
>
>
> Symptom: searching for a string like a domain name containing a '.', the Solr 
> 1.4 analyzer tells me that I will get a match, but when I enter the search 
> either in the client or directly in Solr, the search fails. 
> test string:  Identi.ca
> queries that fail:  IdentiCa, Identi.ca, Identi-ca
> query that matches: Identi ca
> schema in use is:
> http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1
> Screen shots:
> analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
> dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
> dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
> standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
> Whether or not the bug appears is determined by the surrounding text:
> "would be great to have support for Identi.ca on the follow block"
> fails to match "Identi.ca", but putting the content on its own or in another 
> sentence:
> "Support Identi.ca"
> the search matches.  Testing suggests the word "for" is the problem, and it 
> looks like the bug occurs when a stop word preceeds a word that is split up 
> using the word delimiter filter.
> Setting enablePositionIncrements="false" in the stop filter and reindexing 
> causes the searches to match.
> According to Mark Miller in #solr, this bug appears to be fixed already in 
> Solr trunk, either due to the upgraded lucene or changes to the 
> WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1852) enablePositionIncrements="true" can cause searches to fail when they are parsed as phrase queries

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870628#action_12870628
 ] 

Robert Muir commented on SOLR-1852:
---

bq. now this has been in trunk longer, do you feel any more confident about a 
back port?

I feel more confident about the new implementation of WordDelimiterFilter, yes.

I suppose the question here is if the 1.5 branch is dead or not (no one seems 
to commit to it)

> enablePositionIncrements="true" can cause searches to fail when they are 
> parsed as phrase queries
> -
>
> Key: SOLR-1852
> URL: https://issues.apache.org/jira/browse/SOLR-1852
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Peter Wolanin
>Assignee: Robert Muir
> Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch
>
>
> Symptom: searching for a string like a domain name containing a '.', the Solr 
> 1.4 analyzer tells me that I will get a match, but when I enter the search 
> either in the client or directly in Solr, the search fails. 
> test string:  Identi.ca
> queries that fail:  IdentiCa, Identi.ca, Identi-ca
> query that matches: Identi ca
> schema in use is:
> http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1
> Screen shots:
> analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
> dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
> dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
> standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
> Whether or not the bug appears is determined by the surrounding text:
> "would be great to have support for Identi.ca on the follow block"
> fails to match "Identi.ca", but putting the content on its own or in another 
> sentence:
> "Support Identi.ca"
> the search matches.  Testing suggests the word "for" is the problem, and it 
> looks like the bug occurs when a stop word preceeds a word that is split up 
> using the word delimiter filter.
> Setting enablePositionIncrements="false" in the stop filter and reindexing 
> causes the searches to match.
> According to Mark Miller in #solr, this bug appears to be fixed already in 
> Solr trunk, either due to the upgraded lucene or changes to the 
> WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2010-05-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870629#action_12870629
 ] 

Uwe Schindler commented on LUCENE-1622:
---

In my opinion, we should also have a very simply and user-friendly QP like 
Google: no syntax at all. Just tokenize Text with Analyzer and create a 
TermQuery for each token. The only params to this QP are field name and default 
Occur enum.

People should create always ranges and so on programatically. Having this in a 
query parser is stupid. XMLQueryParser is good for this, or maybe we also get a 
JSON query parser (I have plans to create one similar to XML Query Parser, 
maybe using the saem builders). Mark Miller was talking about this for solr, 
too.

> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1852) enablePositionIncrements="true" can cause searches to fail when they are parsed as phrase queries

2010-05-24 Thread Peter Wolanin (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870624#action_12870624
 ] 

Peter Wolanin commented on SOLR-1852:
-

now this has been in trunk longer, do you feel any more confident about a back 
port?

> enablePositionIncrements="true" can cause searches to fail when they are 
> parsed as phrase queries
> -
>
> Key: SOLR-1852
> URL: https://issues.apache.org/jira/browse/SOLR-1852
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Peter Wolanin
>Assignee: Robert Muir
> Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch
>
>
> Symptom: searching for a string like a domain name containing a '.', the Solr 
> 1.4 analyzer tells me that I will get a match, but when I enter the search 
> either in the client or directly in Solr, the search fails. 
> test string:  Identi.ca
> queries that fail:  IdentiCa, Identi.ca, Identi-ca
> query that matches: Identi ca
> schema in use is:
> http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1
> Screen shots:
> analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
> dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
> dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
> standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
> Whether or not the bug appears is determined by the surrounding text:
> "would be great to have support for Identi.ca on the follow block"
> fails to match "Identi.ca", but putting the content on its own or in another 
> sentence:
> "Support Identi.ca"
> the search matches.  Testing suggests the word "for" is the problem, and it 
> looks like the bug occurs when a stop word preceeds a word that is split up 
> using the word delimiter filter.
> Setting enablePositionIncrements="false" in the stop filter and reindexing 
> causes the searches to match.
> According to Mark Miller in #solr, this bug appears to be fixed already in 
> Solr trunk, either due to the upgraded lucene or changes to the 
> WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Andrzej Bialecki as Lucene/Solr committer

2010-05-24 Thread Simon Willnauer
Welcome Andrzej!

simon

On Mon, May 24, 2010 at 11:36 AM, Uwe Schindler  wrote:
> Welcome Andrzej! I am glad to have you finally on the Team :-)
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Monday, May 24, 2010 11:34 AM
>> To: dev@lucene.apache.org
>> Subject: Welcome Andrzej Bialecki as Lucene/Solr committer
>>
>> I'm happy to announce that the PMC has accepted Andrzej Bialecki as
>> Lucene/Solr committer!
>>
>> Welcome aboard Andrzej,
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
>> commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870619#action_12870619
 ] 

Michael McCandless commented on LUCENE-1622:


bq. For other reasons, including this, we should start thinking about removing 
QueryParser's split-on-whitespace.

I think we should remove it!

The fact that QP does this whitespace pre-split means a SynFilter
(that applies to multiple words) is unusable with QP since the
analyzer sees only one word at a time from QP.

And, QP should be as language neutral as possible...


> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2286) enable DefaultSimilarity.setDiscountOverlaps by default

2010-05-24 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-2286:
---

Fix Version/s: 3.1

according to CHANGES.txt, this fix is in branch_3x as well.

> enable DefaultSimilarity.setDiscountOverlaps by default
> ---
>
> Key: LUCENE-2286
> URL: https://issues.apache.org/jira/browse/LUCENE-2286
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2286.patch
>
>
> I think we should enable setDiscountOverlaps in DefaultSimilarity by default.
> If you are using synonyms or commongrams or a number of other 
> 0-posInc-term-injecting methods, these currently screw up your length 
> normalization.
> These terms have a position increment of zero, so they shouldnt count towards 
> the length of the document.
> I've done relevance tests with persian showing the difference is significant, 
> and i think its a big trap to anyone using synonyms, etc: your relevance can 
> actually get worse if you don't flip this boolean flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2010-05-24 Thread Yuval Feinstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870605#action_12870605
 ] 

Yuval Feinstein commented on LUCENE-2091:
-

@Vinay - I have this suggestion. I am unsure whether it will work. 
First, I would implement the BM25BooleanQuery, and use it to create a 
QueryWrapperFilter qwf.
(See 
http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/QueryWrapperFilter.html)
Next, I would create a Phrase query, and call search(phraseQuery, qwf, 50).
This way, the scorer will first look for matches for the BM25 query, and later 
look among them for matches for the phrase query.
Hope this is understandable.
-- Yuval 

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 4.0
>
> Attachments: BM25SimilarityProvider.java, LUCENE-2091.patch, 
> persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Peter Wolanin
We us an autocommit with Solr and I've had this worry too - apparently
if you get a hard crash Solr will roll back the not-yet-committed
docs.

I don't think it's happened more than once in a year, but still possible.

-Peter

On Mon, May 24, 2010 at 9:10 AM,   wrote:
> Hi all,
>
> It seems to me that the “commit” logic in the Solr updateRequestHandler (or
> wherever the logic is actually located) conflates two different semantics.
> One semantic is what you need to do to make the index process perform well.
> The other semantic is guaranteed atomicity of document reception by Solr.
>
> In particular, it would be nice to be able to post documents in such a way
> that you can guarantee that the document is permanently in Solr’s queue,
> safe in the event of a Solr restart, etc., even if the document has not yet
> been “committed”.
>
> This issue came up in the LCF talk that I gave, and I initially thought that
> separating the two kinds of events would necessarily be an LCF change, but
> the more I thought about it the more I realized that other Solr indexing
> clients may also benefit from such a separation.
>
> Does anyone agree?  Where should this logic properly live?
>
> Thanks,
> Karl
>
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread karl.wright
Hi all,

It seems to me that the "commit" logic in the Solr updateRequestHandler (or 
wherever the logic is actually located) conflates two different semantics.  One 
semantic is what you need to do to make the index process perform well.  The 
other semantic is guaranteed atomicity of document reception by Solr.

In particular, it would be nice to be able to post documents in such a way that 
you can guarantee that the document is permanently in Solr's queue, safe in the 
event of a Solr restart, etc., even if the document has not yet been 
"committed".

This issue came up in the LCF talk that I gave, and I initially thought that 
separating the two kinds of events would necessarily be an LCF change, but the 
more I thought about it the more I realized that other Solr indexing clients 
may also benefit from such a separation.

Does anyone agree?  Where should this logic properly live?

Thanks,
Karl






Re: Sorting on Facet Fields

2010-05-24 Thread MitchK

I talked to some people that need another sorting-option as well.

At the moment, it is not possible to sort custom out-of-the-box.

I don't need such a feature right now, but it would be a nice-to-have.
If there is some more interest on the mailing list, I will register at the
JIRA and open an issue for it.

The magic behind the current facet-implementation is, that it works with
DocSets - an unordered Set of Documents. If we would translate the current
implementation into a DocList-based one and make the needed changes at the
FacetComponent (perhaps we rename it as extendedFacetComponent) than we can
sort the facets in the same order of those Documents that belongs to the
facets.

I.e:

Facet-Field: cat

The response of a query looks like: 
Doc1: Name: "My Name is Richard", Cat: "Comedy"
Doc2: Name: "The Empire Strikes Back", Cat: "Science-Fiction"
Doc3: Name: "Episode II - Attack of the Clones", Cat: "Science-Fiction"
Doc4: Name: "American Pie III", Cat: "Comedy"
...

The facet would look like:
"comedy,
Science-Fiction"
because the comedy-facet occurs before the Science-Fiction facet in the
response list.

I think this would have a good performance. 
Perhaps we can sum-up some scores of the top-n results to fine-tune the
facet-response.


One thing that keeps me away from doing such customization is, that I am
relatively new to Java and debugging such changes is extremely expensive to
me, because I don't know how to do so with Eclipse (I am new to Eclipse,
too). If there are some good references for tutorials that deal with such
topics, please, feel free to show them. Until now I haven't found some
helpful ones.
The sooner I can help our community to improve Solr productively, the
better. But that's something that does not belong to your topic, of course
:-).

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-Facet-Fields-tp838958p839580.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2010-05-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870593#action_12870593
 ] 

Robert Muir commented on LUCENE-1622:
-

bq. There are tricky tradeoffs of index time vs search time

The worst tradeoff at all, is that users can't make it.

For other reasons, including this, we should start thinking about removing 
QueryParser's split-on-whitespace.


> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870588#action_12870588
 ] 

Michael McCandless commented on LUCENE-1622:


Here's the dev thread that lead to this issue, for context:

  
http://www.lucidimagination.com/search/document/fde6d4b979481398/synonym_filter_with_support_for_phrases

I think the syn filter here takes generally the same approach as
Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie
overlapping words as the expanded synonyms unwind?  Are there salient
differences between the two?  Maybe we can merge them and get best of
both worlds?

There are tricky tradeoffs of index time vs search time -- index time
is less flexible (you must re-index on changing them) but better
search perf (OR in a TermQuery instead of expanding to many
PhraseQuerys); index time is better scoring (the IDF is "true" if the
syn is a term in the index, vs PhraseQuery which necessarily
approximates, possibly badly).

There is also the controversial question of whether using manually
defined synonyms even helps relevance :) As Robert points out, doing
an iteration of feedback (take the top N docs, that match user's
query, extract their salient terms, and do a 2nd search expanded w/
those salient terms), sort of accomplishes something similar (and
perhaps better since it's not just synonyms but also uncovers
"relationships" like Barack Obama is a US president), but w/o the
manual effort of creating the synonyms.  And it's been shown to
improve relevance.

Still, I think Lucene should make index and query time expansion
feasible.  At the plumbing level we don't have a horse in that race :)

If you do index syns at index time, you really should just inject a
single syn token, representing any occurrence of a term/phrase that
this synonym accepts (and do the matching thing @ query time).  But,
then, as Earwin pointed out, Lucene is missing the notion of "span"
saying how many positions this term took up (we only encode the pos
incr, reflecting where this token begins relative to the last token's
beginning).

EG if "food place" is a syn for "restaurant", and you have a doc
"... a great food place in boston ...", and so you inject RESTAURANT (syn
group) "over" the phrase "food place", then an exact phrase query
won't work right -- you can't have "a great RESTAURANT in boston"
match.

One simple way to express this during analysis is as a new SpanAttr
(say), which expresses how many positions the token takes up.  We
could then index this, doing so efficiently for the default case
(span==1), and then in addition to getting the .nextPosition() you
could then also ask for .span() from DocsAndPositionsEnum.

But, generalizing this a bit, really we are indexing a graph, where
the nodes are positions and the edges are tokens connecting them.
With only posIncr & span, you restrict the nodes to be a single linear
chain; but if we generalize it, then nodes can be part of side
branches; eg the node in the middle of "food place" need not be a
"real" position if it were injected into a document / query containing
restaurant.  Hard boundaries (eg b/w sentences) would be more cleanly
represented here -- there would not even be an edge between the nodes.

We'd then need an AutomatonWordQuery -- the same idea as
AutomatonQuery, except at the word level not at the character level.
MultiPhraseQuery would then be a special case of AutomatonWordQuery.

Then analysis becomes the serializing of this graph... analysis would
have to flatten out the nodes into a single linear chain, and then
express the edges using position & span.  I think position would no
longer be a hard relative position.  EG when injecting "food place" (=
2 tokens) into the tokens that contain restaurant, both food and
restaurant would have the same start position, but food would have
span 1 and restaurant would have span 2.

(Sorry for the rambling... this is a complex topic!!).



> Multi-word synonym filter (synonym expansion at indexing time).
> ---
>
> Key: LUCENE-1622
> URL: https://issues.apache.org/jira/browse/LUCENE-1622
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big

Re: TestBackwardsCompatibility

2010-05-24 Thread Mark Miller

On 5/24/10 11:25 AM, Michael McCandless wrote:

Yes, I think we can remove support for 1.9 indexes as of 3.0:

 http://wiki.apache.org/lucene-java/BackwardsCompatibility

So starting with 3.0 the oldest index we must support are those written by 2.0.

Mike

On Sun, May 23, 2010 at 12:56 AM, Shai Erera  wrote:

Hi

I'm working on adding support for addIndexes* in TestBackwardsCompatibility,
and I've noticed it still reads 1.9 indexes. Is that intentional? Shouldn't
3x stop supporting 1.9?

Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



We really need to update that wiki page - mucho changes.

--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2471) Supporting bulk copies in Directory

2010-05-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2471:
---

Fix Version/s: 3.1
   4.0

> Supporting bulk copies in Directory
> ---
>
> Key: LUCENE-2471
> URL: https://issues.apache.org/jira/browse/LUCENE-2471
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 3.1, 4.0
>
>
> A method can be added to IndexOutput that accepts IndexInput, and writes 
> bytes using it as a source.
> This should be used for bulk-merge cases (offhand - norms, docstores?). Some 
> Directories can then override default impl and skip intermediate buffers 
> (NIO, MMap, RAM?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2471) Supporting bulk copies in Directory

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870560#action_12870560
 ] 

Michael McCandless commented on LUCENE-2471:


I think this issue makes sense, separate from LUCENE-2455?  Ie this issue is 
for bulk copying when you have IndexInput/Output already open (I don't think 
LUCENE-2455 covers this?).  Whereas LUCENE-2455 is operating on file names...

> Supporting bulk copies in Directory
> ---
>
> Key: LUCENE-2471
> URL: https://issues.apache.org/jira/browse/LUCENE-2471
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 3.1, 4.0
>
>
> A method can be added to IndexOutput that accepts IndexInput, and writes 
> bytes using it as a source.
> This should be used for bulk-merge cases (offhand - norms, docstores?). Some 
> Directories can then override default impl and skip intermediate buffers 
> (NIO, MMap, RAM?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870559#action_12870559
 ] 

Michael McCandless commented on LUCENE-2474:


Should we rename this to "CloseEventListener"?  Ie, when an IR is closed it'll 
notify those subscribers who asked to find out?

Also, shouldn't the FieldCache's listener be created/installed from FieldCache, 
not from IR?  Ie, when FieldCache creates an entry it should at that point ask 
the reader to notify it when that reader is closed?

> Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
> custom caches that use the IndexReader (getFieldCacheKey)
> 
>
> Key: LUCENE-2474
> URL: https://issues.apache.org/jira/browse/LUCENE-2474
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shay Banon
> Attachments: LUCENE-2474.patch
>
>
> Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
> custom caches that use the IndexReader (getFieldCacheKey).
> A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its 
> make a lot of sense to cache things based on IndexReader#getFieldCacheKey, 
> even Lucene itself uses it, for example, with the CachingWrapperFilter. 
> FieldCache enjoys being called explicitly to purge its cache when possible 
> (which is tricky to know from the "outside", especially when using NRT - 
> reader attack of the clones).
> The provided patch allows to plug a CacheEvictionListener which will be 
> called when the cache should be purged for an IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870551#action_12870551
 ] 

Michael McCandless commented on LUCENE-2272:


Thanks Peter -- this looks important to fix.

The patch, confusingly, seems to recursively include itself!  Ie I see 
PNQ-patch1.txt in the patch with its own diff lines.  Strange.

Also, how  come your patch removes generics / @Override / @lucene.experimental, 
etc.?


> PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
> ---
>
> Key: LUCENE-2272
> URL: https://issues.apache.org/jira/browse/LUCENE-2272
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Peter Keegan
>Assignee: Grant Ingersoll
> Attachments: payloadfunctin-patch.txt, PNQ-patch.txt, PNQ-patch1.txt
>
>
> The 'explain' method in PayloadNearSpanScorer assumes the 
> AveragePayloadFunction was used. This patch adds the 'explain' method to the 
> 'PayloadFunction' interface, where the Scorer can call it. Added unit tests 
> for 'explain' and for {Min,Max}PayloadFunction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870549#action_12870549
 ] 

Michael McCandless commented on LUCENE-2455:


bq. Backwards support should be much easier there, because we will provide an 
index migration tool anyway, and so CFW/CFR can always assume they're reading 
the latest version (at least in 4.0).

Hmm I think we should do live migration for this (ie don't require a
migration tool to fix your index)?  This is trivial to do on the fly
right (ie as you've done in 3.x).

bq. CFW should probably use CodecUtils in trunk - it cannot be used in 3x 
because of how CFW works today - writing a VInt first, while CodecUtils assumes 
an Int. And I don't think it's healthy to do so much changes on 3x.

Hmm yeah because of the live migration I think CodecUtils is not
actually a fit here (trunk or 3x).


> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870548#action_12870548
 ] 

Michael McCandless commented on LUCENE-2455:


Patch looks great!  So awesome seeing all the -'s in IW.java!!  Keep it up :)

And it's great that you added 3.0 back compat case to
TestBackwardsCompatibility...

Some feedback:

  * Can you change the code to read to a "int firstInt" instead of
version?  And make an explicit version (say "PRE_VERSION"), and
then check if version is PRE_VERSION in the code.  Ie, any tests
against version (eg version > 0) should be against constants
(version == PRE_VEFRSION) not against 0.

  * CFW's comment should be "make it 1 lower" than the current one
right?  Ie, -2 is the next version?


> Some house cleaning in addIndexes*
> --
>
> Key: LUCENE-2455
> URL: https://issues.apache.org/jira/browse/LUCENE-2455
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
> LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Welcome Andrzej Bialecki as Lucene/Solr committer

2010-05-24 Thread Uwe Schindler
Welcome Andrzej! I am glad to have you finally on the Team :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, May 24, 2010 11:34 AM
> To: dev@lucene.apache.org
> Subject: Welcome Andrzej Bialecki as Lucene/Solr committer
> 
> I'm happy to announce that the PMC has accepted Andrzej Bialecki as
> Lucene/Solr committer!
> 
> Welcome aboard Andrzej,
> 
> Mike
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Welcome Andrzej Bialecki as Lucene/Solr committer

2010-05-24 Thread Michael McCandless
I'm happy to announce that the PMC has accepted Andrzej Bialecki as
Lucene/Solr committer!

Welcome aboard Andrzej,

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: TestBackwardsCompatibility

2010-05-24 Thread Uwe Schindler
But as of 3.0.0 it still supports those indexes :-) So wanna remove in 3.1?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, May 24, 2010 11:26 AM
> To: dev@lucene.apache.org
> Subject: Re: TestBackwardsCompatibility
> 
> Yes, I think we can remove support for 1.9 indexes as of 3.0:
> 
> http://wiki.apache.org/lucene-java/BackwardsCompatibility
> 
> So starting with 3.0 the oldest index we must support are those written by
> 2.0.
> 
> Mike
> 
> On Sun, May 23, 2010 at 12:56 AM, Shai Erera  wrote:
> > Hi
> >
> > I'm working on adding support for addIndexes* in
> > TestBackwardsCompatibility, and I've noticed it still reads 1.9
> > indexes. Is that intentional? Shouldn't 3x stop supporting 1.9?
> >
> > Shai
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: TestBackwardsCompatibility

2010-05-24 Thread Michael McCandless
Yes, I think we can remove support for 1.9 indexes as of 3.0:

http://wiki.apache.org/lucene-java/BackwardsCompatibility

So starting with 3.0 the oldest index we must support are those written by 2.0.

Mike

On Sun, May 23, 2010 at 12:56 AM, Shai Erera  wrote:
> Hi
>
> I'm working on adding support for addIndexes* in TestBackwardsCompatibility,
> and I've noticed it still reads 1.9 indexes. Is that intentional? Shouldn't
> 3x stop supporting 1.9?
>
> Shai
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org