date:20100513

Re: Build failed in Hudson: Lucene-trunk #1187

2010-05-13 Thread Robert Muir

the problem is a logic bug (e.g. i have no clue how to really fix
except to switch over to a UTF-8 sort order).

in converting automaton to utf-8/32, and trying to emulate the utf-16
term dictionary order, the byte transition ranges (although sorted in
utf-16 order) are themselves in utf-8/32 order: e.g. a byte range of
0xe0-0xef is problematic during enumeration since the 0xee-0xef
component should be "sorted last" in utf-16 order.

i know a workaround until we switch over, but its gonna cause wasted
seeks at the least (its just wrong).


On Thu, May 13, 2010 at 11:12 PM, Apache Hudson Server
 wrote:
> See 
>
> Changes:
>
> [mikemccand] LUCENE-2393: add total TF tracking to HighFreqTerms tool
>
> [mikemccand] LUCENE-2459: fix FilterIndexReader to (by default) emulate flex 
> API on top of pre-flex API
>
> [mikemccand] LUCENE-2449: fix DBLRU cache to clone key when it promotes an 
> entry during lookup
>
> --
> [...truncated 13128 lines...]
>    [mkdir] Created dir: 
> 
>  [javadoc] Generating Javadoc
>  [javadoc] Javadoc execution
>  [javadoc] Loading source files for package org.apache.lucene.index.memory...
>  [javadoc] Constructing Javadoc information...
>  [javadoc] Standard Doclet version 1.5.0_22
>  [javadoc] Building tree for all the packages and classes...
>  [javadoc] Building index for all the packages and classes...
>  [javadoc] Building index for all classes...
>  [javadoc] Generating 
> 
>  [javadoc] Note: Custom tags that were not seen: �...@lucene.experimental, 
> @lucene.internal
>      [jar] Building jar: 
> 
>     [echo] Building misc...
>
> javadocs:
>    [mkdir] Created dir: 
> 
>  [javadoc] Generating Javadoc
>  [javadoc] Javadoc execution
>  [javadoc] Loading source files for package org.apache.lucene.index...
>  [javadoc] Loading source files for package org.apache.lucene.misc...
>  [javadoc] Constructing Javadoc information...
>  [javadoc] Standard Doclet version 1.5.0_22
>  [javadoc] Building tree for all the packages and classes...
>  [javadoc] 
> :43:
>  warning - Tag @link: reference not found: 
> IndexWriter#addIndexes(IndexReader[])
>  [javadoc] Building index for all the packages and classes...
>  [javadoc] Building index for all classes...
>  [javadoc] Generating 
> 
>  [javadoc] Note: Custom tags that were not seen: �...@lucene.internal
>  [javadoc] 1 warning
>      [jar] Building jar: 
> 
>     [echo] Building queries...
>
> javadocs:
>    [mkdir] Created dir: 
> 
>  [javadoc] Generating Javadoc
>  [javadoc] Javadoc execution
>  [javadoc] Loading source files for package org.apache.lucene.search...
>  [javadoc] Loading source files for package org.apache.lucene.search.regex...
>  [javadoc] Loading source files for package 
> org.apache.lucene.search.similar...
>  [javadoc] Constructing Javadoc information...
>  [javadoc] Standard Doclet version 1.5.0_22
>  [javadoc] Building tree for all the packages and classes...
>  [javadoc] Building index for all the packages and classes...
>  [javadoc] Building index for all classes...
>  [javadoc] Generating 
> 
>  [javadoc] Note: Custom tags that were not seen: �...@lucene.experimental, 
> @lucene.internal
>      [jar] Building jar: 
> 
>     [echo] Building queryparser...
>
> javadocs:
>    [mkdir] Created dir: 
> 
>  [javadoc] Generating Javadoc
>  [javadoc] Javadoc execution
>  [javadoc] Loading source files for package 
> org.apache.lucene.queryParser.analyzing...
>  [javadoc] Loading source files for package 
> org.apache.lucene.queryParser.complexPhrase...
>  [javadoc] Loading source files for package 
> org.apache.lucene.queryParser.core...
>  [javadoc] Loading source files fo

RE: Build failed in Hudson: Lucene-trunk #1187

2010-05-13 Thread Uwe Schindler

A new failure:

[junit] Testsuite: org.apache.lucene.search.TestRegexpRandom2
[junit] Testcase: testRegexps(org.apache.lucene.search.TestRegexpRandom2):  
FAILED
[junit] state=-1
[junit] junit.framework.AssertionFailedError: state=-1
[junit] at 
org.apache.lucene.search.AutomatonTermsEnum.setLinear(AutomatonTermsEnum.java:182)
[junit] at 
org.apache.lucene.search.AutomatonTermsEnum.nextSeekTerm(AutomatonTermsEnum.java:162)
[junit] at 
org.apache.lucene.search.FilteredTermsEnum.next(FilteredTermsEnum.java:200)
[junit] at 
org.apache.lucene.search.MultiTermQuery$BooleanQueryRewrite.collectTerms(MultiTermQuery.java:222)
[junit] at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:568)
[junit] at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:755)
[junit] at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:270)
[junit] at org.apache.lucene.search.Query.weight(Query.java:100)
[junit] at 
org.apache.lucene.search.Searcher.createWeight(Searcher.java:147)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:98)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:108)
[junit] at 
org.apache.lucene.search.TestRegexpRandom2.assertSame(TestRegexpRandom2.java:135)
[junit] at 
org.apache.lucene.search.TestRegexpRandom2.__CLR2_6_34z034v1p9p(TestRegexpRandom2.java:117)
[junit] at 
org.apache.lucene.search.TestRegexpRandom2.testRegexps(TestRegexpRandom2.java:115)
[junit] at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:276)
[junit] 
[junit] 
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 3.403 sec
[junit] 
[junit] - Standard Output ---
[junit] NOTE: random seed of testcase 'testRegexps' was: 
-1754577092406875861
[junit] -  ---
[junit] TEST org.apache.lucene.search.TestRegexpRandom2 FAILED

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
> Sent: Friday, May 14, 2010 5:12 AM
> To: dev@lucene.apache.org
> Subject: Build failed in Hudson: Lucene-trunk #1187
> 
> See  trunk/1187/changes>
> 
> Changes:
> 
> [mikemccand] LUCENE-2393: add total TF tracking to HighFreqTerms tool
> 
> [mikemccand] LUCENE-2459: fix FilterIndexReader to (by default) emulate
> flex API on top of pre-flex API
> 
> [mikemccand] LUCENE-2449: fix DBLRU cache to clone key when it promotes
> an entry during lookup
> 
> --
> [...truncated 13128 lines...]
> [mkdir] Created dir:
>  trunk/ws/lucene/build/docs/api/contrib-memory>
>   [javadoc] Generating Javadoc
>   [javadoc] Javadoc execution
>   [javadoc] Loading source files for package
> org.apache.lucene.index.memory...
>   [javadoc] Constructing Javadoc information...
>   [javadoc] Standard Doclet version 1.5.0_22
>   [javadoc] Building tree for all the packages and classes...
>   [javadoc] Building index for all the packages and classes...
>   [javadoc] Building index for all classes...
>   [javadoc] Generating
>  trunk/ws/lucene/build/docs/api/contrib-memory/stylesheet.css...>
>   [javadoc] Note: Custom tags that were not seen:
> @lucene.experimental, @lucene.internal
>   [jar] Building jar:
>  trunk/ws/lucene/build/contrib/memory/lucene-memory-2010-05-14_02-03-41-
> javadoc.jar>
>  [echo] Building misc...
> 
> javadocs:
> [mkdir] Created dir:
>  trunk/ws/lucene/build/docs/api/contrib-misc>
>   [javadoc] Generating Javadoc
>   [javadoc] Javadoc execution
>   [javadoc] Loading source files for package org.apache.lucene.index...
>   [javadoc] Loading source files for package org.apache.lucene.misc...
>   [javadoc] Constructing Javadoc information...
>   [javadoc] Standard Doclet version 1.5.0_22
>   [javadoc] Building tree for all the packages and classes...
>   [javadoc]  trunk/ws/lucene/contrib/misc/src/java/org/apache/lucene/index/MultiPass
> IndexSplitter.java>:43: warning - Tag @link: reference not found:
> IndexWriter#addIndexes(IndexReader[])
>   [javadoc] Building index for all the packages and classes...
>   [javadoc] Building index for all classes...
>   [javadoc] Generating
>  trunk/ws/lucene/build/docs/api/contrib-misc/stylesheet.css...>
>   [javadoc] Note: Custom tags that were not seen:  @lucene.internal
>   [javadoc] 1 warning
>   [jar] Building jar:
>

[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867382#action_12867382
 ] 

Shai Erera commented on LUCENE-1585:


bq. Could we also use this for the term bytes itsself?

I think you'd want to use the same approach, yes. But I'm not sure I want to 
reuse the same classes for that purpose, for several reasons:
* The classes have the word Payload all over the place - javadocs, names etc. 
And for a good reason IMO - that's what they do.
* One is expected to include different PP for different Directory and Term, but 
to convert the NumericField terms, I don't think one will use different PPs at 
all? I.e. a single TermsConverter / NumericFieldsTermConverter will be good for 
whatever Directory + Term?
* The sort of operation you suggest (converting terms) seems to be a one time 
op -- when I migrate my indexes? PPP on the other hand (at least in my case) 
will be used whenever I call addIndexes* so that I can process and rewrite the 
payloads of the incoming indexes.

So while both will do byte[] conversion, I think those are two separate tools. 
Your should probably exist in a o.a.l.migration package or something, because 
it will be relevant to index migration only? Or did I misunderstood you?

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1585:
---

Attachment: LUCENE-1585_trunk.patch

Good idea Mike !

I've added hasPayloadProcessorProvider to MergeState and use it in 
TermsConsumer.

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Build failed in Hudson: Lucene-trunk #1187

2010-05-13 Thread Apache Hudson Server

See 

Changes:

[mikemccand] LUCENE-2393: add total TF tracking to HighFreqTerms tool

[mikemccand] LUCENE-2459: fix FilterIndexReader to (by default) emulate flex 
API on top of pre-flex API

[mikemccand] LUCENE-2449: fix DBLRU cache to clone key when it promotes an 
entry during lookup

--
[...truncated 13128 lines...]
[mkdir] Created dir: 

  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.index.memory...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_22
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 

  [javadoc] Note: Custom tags that were not seen:  @lucene.experimental, 
@lucene.internal
  [jar] Building jar: 

 [echo] Building misc...

javadocs:
[mkdir] Created dir: 

  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.index...
  [javadoc] Loading source files for package org.apache.lucene.misc...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_22
  [javadoc] Building tree for all the packages and classes...
  [javadoc] 
:43:
 warning - Tag @link: reference not found: IndexWriter#addIndexes(IndexReader[])
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 

  [javadoc] Note: Custom tags that were not seen:  @lucene.internal
  [javadoc] 1 warning
  [jar] Building jar: 

 [echo] Building queries...

javadocs:
[mkdir] Created dir: 

  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.search...
  [javadoc] Loading source files for package org.apache.lucene.search.regex...
  [javadoc] Loading source files for package org.apache.lucene.search.similar...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_22
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 

  [javadoc] Note: Custom tags that were not seen:  @lucene.experimental, 
@lucene.internal
  [jar] Building jar: 

 [echo] Building queryparser...

javadocs:
[mkdir] Created dir: 

  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.analyzing...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.complexPhrase...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.builders...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.config...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.messages...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.nodes...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.parser...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.processors...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.core.util...
  [javadoc] Loading source files for package 
org.apache.lucene.queryParser.ext...
  [javadoc] Loading source files for package 
org.a

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

2010-05-13 Thread Israel Ekpo

Correction,

I meant to list

https://issues.apache.org/jira/browse/LUCENE-2460
https://issues.apache.org/jira/browse/SOLR-1913



On Thu, May 13, 2010 at 10:13 PM, Israel Ekpo  wrote:

> I have created two ISSUES as new features
>
> https://issues.apache.org/jira/browse/LUCENE-1560
>
> https://issues.apache.org/jira/browse/SOLR-1913
>
> The first one is for the Lucene Filter.
>
> The second one is for the Solr QParserPlugin
>
> The source code and jar files are attached and the Solr plugin is available
> for use immediately.
>
>
>
>
> On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki  wrote:
>
>> On 2010-05-13 23:27, Israel Ekpo wrote:
>> > Hello Lucene and Solr Community
>> >
>> > I have a custom org.apache.lucene.search.Filter that I would like to
>> > contribute to the Lucene and Solr projects.
>> >
>> > So I would need some direction as to how to create and ISSUE or submit a
>> > patch.
>> >
>> > It looks like there have been changes to the way this is done since the
>> > latest merge of the two projects (Lucene and Solr).
>> >
>> > Recently, some Solr users have been looking for a way to perform bitwise
>> > operations between and integer value and some fields in the Index
>> >
>> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
>> Filter.
>> >
>> > This package makes it possible to filter results returned from a query
>> based
>> > on the results of a bitwise operation on an integer field in the
>> documents
>> > returned from the pre-constructed query.
>>
>> Hi,
>>
>> What a coincidence! :) I'm working on something very similar, only the
>> use case that I need to support is slightly different - I want to
>> support a ranked search based on a bitwise overlap of query value and
>> field value. That is, the number of differing bits would reduce the
>> score. This scenario occurs e.g. during near-duplicate detection that
>> uses fuzzy signatures, on document- or sentence levels.
>>
>> I'm going to submit my code early next week, it still needs some
>> polishing. I have two ways to execute this query, neither of which uses
>> filters at the moment:
>>
>> * method 1: during indexing the bits in the fields are turned into
>> on/off terms on the same field, and during search a BooleanQuery is
>> formed from the int value with the same terms. Scoring is courtesy of
>> BooleanScorer. This method supports only a single int value per field.
>>
>> * method 2, incomplete yet - during indexing the bits are turned into
>> terms as before, but this method supports multiple int values per field:
>> terms that correspond to bitmasks on the same value are put at the same
>> positions. Then a specialized Query / Scorer traverses all 32 posting
>> lists in parallel, moving through all matching docs and scoring
>> according to how many terms matched at the same position.
>>
>> I wrapped this in a Solr FieldType, and instead of using a custom
>> QParser plugin I simply implemented FieldType.getFieldQuery().
>>
>> It would be great to work out a convenient user-level API for this
>> feature, both the scoring and the non-scoring case.
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>>  ___. ___ ___ ___ _ _   __
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>



-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

2010-05-13 Thread Israel Ekpo

I have created two ISSUES as new features

https://issues.apache.org/jira/browse/LUCENE-1560

https://issues.apache.org/jira/browse/SOLR-1913

The first one is for the Lucene Filter.

The second one is for the Solr QParserPlugin

The source code and jar files are attached and the Solr plugin is available
for use immediately.



On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki  wrote:

> On 2010-05-13 23:27, Israel Ekpo wrote:
> > Hello Lucene and Solr Community
> >
> > I have a custom org.apache.lucene.search.Filter that I would like to
> > contribute to the Lucene and Solr projects.
> >
> > So I would need some direction as to how to create and ISSUE or submit a
> > patch.
> >
> > It looks like there have been changes to the way this is done since the
> > latest merge of the two projects (Lucene and Solr).
> >
> > Recently, some Solr users have been looking for a way to perform bitwise
> > operations between and integer value and some fields in the Index
> >
> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
> Filter.
> >
> > This package makes it possible to filter results returned from a query
> based
> > on the results of a bitwise operation on an integer field in the
> documents
> > returned from the pre-constructed query.
>
> Hi,
>
> What a coincidence! :) I'm working on something very similar, only the
> use case that I need to support is slightly different - I want to
> support a ranked search based on a bitwise overlap of query value and
> field value. That is, the number of differing bits would reduce the
> score. This scenario occurs e.g. during near-duplicate detection that
> uses fuzzy signatures, on document- or sentence levels.
>
> I'm going to submit my code early next week, it still needs some
> polishing. I have two ways to execute this query, neither of which uses
> filters at the moment:
>
> * method 1: during indexing the bits in the fields are turned into
> on/off terms on the same field, and during search a BooleanQuery is
> formed from the int value with the same terms. Scoring is courtesy of
> BooleanScorer. This method supports only a single int value per field.
>
> * method 2, incomplete yet - during indexing the bits are turned into
> terms as before, but this method supports multiple int values per field:
> terms that correspond to bitmasks on the same value are put at the same
> positions. Then a specialized Query / Scorer traverses all 32 posting
> lists in parallel, moving through all matching docs and scoring
> according to how many terms matched at the same position.
>
> I wrapped this in a Solr FieldType, and instead of using a custom
> QParser plugin I simply implemented FieldType.getFieldQuery().
>
> It would be great to work out a convenient user-level API for this
> feature, both the scoring and the non-scoring case.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

[jira] Updated: (SOLR-1913) QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields

2010-05-13 Thread Israel Ekpo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Israel Ekpo updated SOLR-1913:
--

Attachment: SOLR-1913.bitwise.tar.gz
bitwise_filter_plugin.jar

Attaching JAR file containing the QParserPlugin

To test out this plugin, simply copy the jar file containing the plugin classes 
into your $SOLR_HOME/lib directory and then
add the following to your solrconfig.xml file after the dismax request handler:



Restart your servlet container.

> QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations 
> on Integer Fields
> ---
>
> Key: SOLR-1913
> URL: https://issues.apache.org/jira/browse/SOLR-1913
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Israel Ekpo
> Fix For: 1.4, 1.5, 1.6, 3.1, 4.0
>
> Attachments: bitwise_filter_plugin.jar, SOLR-1913.bitwise.tar.gz
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> BitwiseQueryParserPlugin is a org.apache.solr.search.QParserPlugin that 
> allows 
> users to filter the documents returned from a query
> by performing bitwise operations between a particular integer field in the 
> index
> and the specified value.
> This Solr plugin is based on the BitwiseFilter in LUCENE-2460
> See https://issues.apache.org/jira/browse/LUCENE-2460 for more details
> This is the syntax for searching in Solr:
> http://localhost:8983/path/to/solr/select/?q={!bitwise field=fieldname 
> op=OPERATION_NAME source=sourcevalue negate=boolean}remainder of query
> Example :
> http://localhost:8983/solr/bitwise/select/?q={!bitwise field=user_permissions 
> op=AND source=3 negate=true}state:FL
> The negate parameter is optional
> The field parameter is the name of the integer field
> The op parameter is the name of the operation; one of {AND, OR, XOR}
> The source parameter is the specified integer value
> The negate parameter is a boolean indicating whether or not to negate the 
> results of the bitwise operation
> To test out this plugin, simply copy the jar file containing the plugin 
> classes into your $SOLR_HOME/lib directory and then
> add the following to your solrconfig.xml file after the dismax request 
> handler:
>  class="org.apache.solr.bitwise.BitwiseQueryParserPlugin" basedOn="dismax" />
> Restart your servlet container.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-1913) QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations on Integer Fields

2010-05-13 Thread Israel Ekpo (JIRA)

QParserPlugin plugin for Search Results Filtering Based on Bitwise Operations 
on Integer Fields
---

 Key: SOLR-1913
 URL: https://issues.apache.org/jira/browse/SOLR-1913
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Israel Ekpo
 Fix For: 1.5, 1.6, 3.1, 4.0, 1.4



BitwiseQueryParserPlugin is a org.apache.solr.search.QParserPlugin that allows 
users to filter the documents returned from a query
by performing bitwise operations between a particular integer field in the index
and the specified value.

This Solr plugin is based on the BitwiseFilter in LUCENE-2460

See https://issues.apache.org/jira/browse/LUCENE-2460 for more details

This is the syntax for searching in Solr:

http://localhost:8983/path/to/solr/select/?q={!bitwise field=fieldname 
op=OPERATION_NAME source=sourcevalue negate=boolean}remainder of query

Example :

http://localhost:8983/solr/bitwise/select/?q={!bitwise field=user_permissions 
op=AND source=3 negate=true}state:FL

The negate parameter is optional

The field parameter is the name of the integer field
The op parameter is the name of the operation; one of {AND, OR, XOR}
The source parameter is the specified integer value
The negate parameter is a boolean indicating whether or not to negate the 
results of the bitwise operation

To test out this plugin, simply copy the jar file containing the plugin classes 
into your $SOLR_HOME/lib directory and then
add the following to your solrconfig.xml file after the dismax request handler:



Restart your servlet container.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1912) Replication handler should offer more useful status messages, especially during fsync/commit/etc.

2010-05-13 Thread Chris Harris (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1912:
---

Attachment: SOLR-1912.patch

> Replication handler should offer more useful status messages, especially 
> during fsync/commit/etc.
> -
>
> Key: SOLR-1912
> URL: https://issues.apache.org/jira/browse/SOLR-1912
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (java)
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1912.patch
>
>
> If you go to the replication admin page 
> (http://server:port/solr/core/admin/replication/index.jsp) while replication 
> is in progress, then you'll see a "Current Replication Status" section, which 
> indicates how far along the replication download is, both overall and for the 
> current file. It's great to see this status info. However, the replication 
> admin page becomes misleading once the last file has been downloaded. In 
> particular, after all downloads are complete Solr 1.4 continues to display 
> things like this:
> {quote}
> Downloading File: _wv_1.del, Downloaded: 44 bytes / 44 bytes [100.0%] 
> {quote}
> until all the index copying, fsync-ing, committing, and so on are complete. 
> It gives the disconcerting impression that data transfer between master and 
> slaves has mysteriously stalled right at the end of a 44 byte download. In 
> case this is weird, let me mention that after a full replication I did just 
> now, Solr spent quite a while in SnapPuller.terminateAndWaitFsyncService(), 
> somewhere between many seconds and maybe 5 minutes.
> I propose that the admin page should offer more useful status messages while 
> fsync/etc. are going on. I offer an initial patch that does this. SnapPuller 
> is modified to always offer a human readable indication of the "current 
> operation", and this is displayed on the replication page. We also stop 
> showing progress indication for the "current file", except when there 
> actually is a file currently being downloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2460) Search Results Filtering Based on Bitwise Operations on Integer Fields

2010-05-13 Thread Israel Ekpo (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Israel Ekpo updated LUCENE-2460:


Attachment: LUCENE-2460-bitwise.tar.gz

Attaching the package containing the BitwiseFilter class

> Search Results Filtering Based on Bitwise Operations on Integer Fields
> --
>
> Key: LUCENE-2460
> URL: https://issues.apache.org/jira/browse/LUCENE-2460
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Israel Ekpo
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2460-bitwise.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> This package makes it possible to filter results returned from a query based 
> on the results of a bitwise operation on an integer field in the documents 
> returned from the pre-constructed query.
> You can perform three basic types of operations on these integer fields
> * BitwiseOperation.BITWISE_AND (bitwise AND)
> * BitwiseOperation.BITWISE_OR (bitwise inclusive OR)
> * BitwiseOperation.BITWISE_XOR (bitwise exclusive OR)
> You can also negate the results of these operations.
> For example, imagine there is an integer field in the index named "flags" 
> with the a value 8 (1000 in binary). The following results will be expected :
>1. A source value of 8 will match during a BitwiseOperation.BITWISE_AND 
> operation, with negate set to false.
>2. A source value of 4 will match during a BitwiseOperation.BITWISE_AND 
> operation, with negate set to true.
> The BitwiseFilter constructor accepts the following values
> * The name of the integer field (A string)
> * The BitwiseOperation object. Example BitwiseOperation.BITWISE_XOR
> * The source value (an integer)
> * A boolean value indicating whether or not to negate the results of the 
> operation
> * A pre-constructed org.apache.lucene.search.Query

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-1912) Replication handler should offer more useful status messages, especially during fsync/commit/etc.

2010-05-13 Thread Chris Harris (JIRA)

Replication handler should offer more useful status messages, especially during 
fsync/commit/etc.
-

 Key: SOLR-1912
 URL: https://issues.apache.org/jira/browse/SOLR-1912
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Affects Versions: 1.4
Reporter: Chris Harris
 Attachments: SOLR-1912.patch

If you go to the replication admin page 
(http://server:port/solr/core/admin/replication/index.jsp) while replication is 
in progress, then you'll see a "Current Replication Status" section, which 
indicates how far along the replication download is, both overall and for the 
current file. It's great to see this status info. However, the replication 
admin page becomes misleading once the last file has been downloaded. In 
particular, after all downloads are complete Solr 1.4 continues to display 
things like this:

{quote}
Downloading File: _wv_1.del, Downloaded: 44 bytes / 44 bytes [100.0%] 
{quote}

until all the index copying, fsync-ing, committing, and so on are complete. It 
gives the disconcerting impression that data transfer between master and slaves 
has mysteriously stalled right at the end of a 44 byte download. In case this 
is weird, let me mention that after a full replication I did just now, Solr 
spent quite a while in SnapPuller.terminateAndWaitFsyncService(), somewhere 
between many seconds and maybe 5 minutes.

I propose that the admin page should offer more useful status messages while 
fsync/etc. are going on. I offer an initial patch that does this. SnapPuller is 
modified to always offer a human readable indication of the "current 
operation", and this is displayed on the replication page. We also stop showing 
progress indication for the "current file", except when there actually is a 
file currently being downloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2460) Search Results Filtering Based on Bitwise Operations on Integer Fields

2010-05-13 Thread Israel Ekpo (JIRA)

Search Results Filtering Based on Bitwise Operations on Integer Fields
--

 Key: LUCENE-2460
 URL: https://issues.apache.org/jira/browse/LUCENE-2460
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Israel Ekpo
 Fix For: 3.1, 4.0


This package makes it possible to filter results returned from a query based on 
the results of a bitwise operation on an integer field in the documents 
returned from the pre-constructed query.

You can perform three basic types of operations on these integer fields

* BitwiseOperation.BITWISE_AND (bitwise AND)
* BitwiseOperation.BITWISE_OR (bitwise inclusive OR)
* BitwiseOperation.BITWISE_XOR (bitwise exclusive OR)

You can also negate the results of these operations.

For example, imagine there is an integer field in the index named "flags" with 
the a value 8 (1000 in binary). The following results will be expected :

   1. A source value of 8 will match during a BitwiseOperation.BITWISE_AND 
operation, with negate set to false.
   2. A source value of 4 will match during a BitwiseOperation.BITWISE_AND 
operation, with negate set to true.

The BitwiseFilter constructor accepts the following values

* The name of the integer field (A string)
* The BitwiseOperation object. Example BitwiseOperation.BITWISE_XOR
* The source value (an integer)
* A boolean value indicating whether or not to negate the results of the 
operation
* A pre-constructed org.apache.lucene.search.Query

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

2010-05-13 Thread Andrzej Bialecki

On 2010-05-13 23:27, Israel Ekpo wrote:
> Hello Lucene and Solr Community
> 
> I have a custom org.apache.lucene.search.Filter that I would like to
> contribute to the Lucene and Solr projects.
> 
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
> 
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
> 
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
> 
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
> 
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.

Hi,

What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Bitwise Operations on Integer Fields in Lucene and Solr Index

2010-05-13 Thread Israel Ekpo

Hello Lucene and Solr Community

I have a custom org.apache.lucene.search.Filter that I would like to
contribute to the Lucene and Solr projects.

So I would need some direction as to how to create and ISSUE or submit a
patch.

It looks like there have been changes to the way this is done since the
latest merge of the two projects (Lucene and Solr).

Recently, some Solr users have been looking for a way to perform bitwise
operations between and integer value and some fields in the Index

So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.

This package makes it possible to filter results returned from a query based
on the results of a bitwise operation on an integer field in the documents
returned from the pre-constructed query.

You can perform three basic types of operations on these integer fields

* BitwiseOperation.BITWISE_AND (bitwise AND)
* BitwiseOperation.BITWISE_OR (bitwise inclusive OR)
* BitwiseOperation.BITWISE_XOR (bitwise exclusive OR)

You can also negate the results of these operations.

For example, imagine there is an integer field in the index named "flags"
with the a value 8 (1000 in binary). The following results will be expected
:

   1. A source value of 8 will match during a BitwiseOperation.BITWISE_AND
operation, with negate set to false.
   2. A source value of 4 will match during a BitwiseOperation.BITWISE_AND
operation, with negate set to true.

The BitwiseFilter constructor accepts the following values

* The name of the integer field (A string)
* The BitwiseOperation object. Example BitwiseOperation.BITWISE_XOR
* The source value (an integer)
* A boolean value indicating whether or not to negate the results of the
operation
* A pre-constructed org.apache.lucene.search.Query

Here is an example of how you would use it with Solr

http://localhost:8983/solr/bitwise/select/?q={!bitwisefield=user_permissions
op=AND source=3 negate=true}state:FL

http://localhost:8983/solr/bitwise/select/?q={!bitwisefield=user_permissions
op=AND source=3}state:FL

Here is an example of how you would use it with Lucene

public class BitwiseTestSearch extends BitwiseTestBase {

public BitwiseTestSearch()
{

}

public void search() throws IOException, ParseException
{
setupSearch();

// term
Term t = new Term(COUNTRY_KEY, "us");

// term query
Query q = new TermQuery(t);

// maximum number of documents to display
int limit = 1000;

int sourceValue = 0 ;

boolean negate = false;

BitwiseFilter bitwiseFilter = new BitwiseFilter(USER_PERMS_KEY,
BitwiseOperation.BITWISE_XOR, sourceValue, negate, q);

Query fq = new FilteredQuery(q, bitwiseFilter);

ScoreDoc[] hits = isearcher.search(fq, null, limit).scoreDocs;

BitwiseResultFilter resultFilter = bitwiseFilter.getResultFilter();

for (int i = 0; i < hits.length; i++) {

Document hitDoc = isearcher.doc(hits[i].doc);

System.out.println(FIRST_NAME_KEY + " field has a value of " +
hitDoc.get(FIRST_NAME_KEY));
System.out.println(LAST_NAME_KEY + " field has a value of " +
hitDoc.get(LAST_NAME_KEY));
System.out.println(ACTIVE_KEY + " field has a value of " +
hitDoc.get(ACTIVE_KEY));

System.out.println(USER_PERMS_KEY + " field has a value of " +
hitDoc.get(USER_PERMS_KEY));

System.out.println("doc ID --> " + hits[i].doc);


System.out.println("...");
}

System.out.println("sourceValue = " + sourceValue + ",operation = "
+ resultFilter.getOperation().getOperationName() + ", negate = " + negate);

System.out.println("A total of " + hits.length + " documents were
found from the search\n");

shutdown();
}

public static void main(String args[]) throws IOException,
ParseException
{
BitwiseTestSearch search = new BitwiseTestSearch();

search.search();
}
}

Any guidance would be highly appreciated.

Thanks.


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867260#action_12867260
 ] 

Uwe Schindler commented on LUCENE-1585:
---

Just an idea:
Could we also use this for the term bytes itsself? E.g. when converting 
NumericFields in our 4.0 inde x converter to use the full 8bits? So we just 
process the old index and merge to the converted one? During that all terms are 
converted using the processor?

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867254#action_12867254
 ] 

Michael McCandless commented on LUCENE-1585:


Patches look good Shai.

One thing -- in TermsConsumer, it'd be nice to not step through the for loop 
checking for non-null dirPP, if the IW had no PPP?

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1585:
---

Attachment: LUCENE-1585_trunk.patch

Update trunk's patch.

All tests pass. I plan to commit this tomorrow.

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1900) move Solr to flex APIs

2010-05-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867233#action_12867233
 ] 

Yonik Seeley commented on SOLR-1900:


Just committed a fix (r943994) so that getDocSet skips deleted docs.
This didn't seem to cause any issues because the generated sets are always 
intersected with other sets (like a base doc set) that does exclude deleted 
docs.

> move Solr to flex APIs
> --
>
> Key: SOLR-1900
> URL: https://issues.apache.org/jira/browse/SOLR-1900
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.0
>Reporter: Yonik Seeley
> Fix For: 4.0
>
> Attachments: SOLR-1900-facet_enum.patch, SOLR-1900-facet_enum.patch
>
>
> Solr should use flex APIs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-236) Field collapsing

2010-05-13 Thread Sergey Shinderuk (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867227#action_12867227
 ] 

Sergey Shinderuk commented on SOLR-236:
---

Finally I applied SOLR-236.patch to rev 899572 (dtd. 2010-01-15) of the trunk 
and I get correct numFound values with collapsing enabled.

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.5
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
> field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
> quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1585:
---

Attachment: LUCENE-1585_3x.patch

setPPP moved from IWC to IW. I'll go update the trunk one now.

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

2010-05-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867208#action_12867208
 ] 

Michael McCandless commented on LUCENE-1585:


OK so let's have it as a free setter on IW...

> Allow to control how payloads are merged
> 
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2393.


Fix Version/s: 4.0
   Resolution: Fixed

Thanks Tom!

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, 
> LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a pair of command line utilities that provide information on the 
> total number of occurrences of a term in a Lucene index.  The first  takes a 
> field name, term, and index directory and outputs the document frequency for 
> the term and the total number of occurrences of the term in the index (i.e. 
> the sum of the tf of the term for each document).   The second reads the 
> index to determine the top N most frequent terms (by document frequency) and 
> then outputs a list of those terms along with  the document frequency and the 
> total number of occurrences of the term. Both utilities are useful for 
> estimating the size of the term's entry in the *prx files and consequent Disk 
> I/O demands. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-1911) File descriptor leak while indexing, may cause index corruption

2010-05-13 Thread Simon Rosenthal (JIRA)

File descriptor leak while indexing, may cause index corruption
---

 Key: SOLR-1911
 URL: https://issues.apache.org/jira/browse/SOLR-1911
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 1.5
 Environment: Ubuntu Linux, Java build 1.6.0_16-b01
Solr Specification Version: 3.0.0.2010.05.12.16.17.46
Solr Implementation Version: 4.0-dev exported - simon - 2010-05-12 
16:17:46  -- bult from updated trunk
Lucene Specification Version: 4.0-dev
Lucene Implementation Version: 4.0-dev exported - 2010-05-12 16:18:26
Current Time: Thu May 13 12:21:12 EDT 2010
Server Start Time:Thu May 13 11:45:41 EDT 2010
Reporter: Simon Rosenthal
Priority: Critical


While adding documents to an already existing index using this build, the 
number of open file descriptors increases dramatically until the open file 
per-process limit is reached (1024) , at which point there are error messages 
in the log to that effect. If the server is restarted the index may be corrupt

the solr log reports:

May 13, 2010 12:37:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
May 13, 2010 12:37:04 PM 
org.apache.solr.update.DirectUpdateHandler2$CommitTracker run
SEVERE: auto commit error...
java.io.FileNotFoundException: /home/simon/rig2/solr/core1/data/index/_j2.nrm 
(Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:69)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:90)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:80)
at 
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67)
at 
org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:1093)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:532)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:634)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:610)
at 
org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1012)
at 
org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4563)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3775)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3623)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3614)
at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1769)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1732)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1696)
at 
org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:230)
at 
org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:181)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409)
at 
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run(DirectUpdateHandler2.java:602)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
May 13, 2010 12:37:04 PM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: {} 0 1
May 13, 2010 12:37:04 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: directory '/home/simon/rig2/solr/core1/data/index' 
exists and is a directory, but cannot be listed: list() returned null
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:582)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:316)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1129)

[jira] Resolved: (LUCENE-2459) FilterIndexReader doesn't work correctly with post-flex SegmentMerger

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2459.


Fix Version/s: 4.0
   Resolution: Fixed

Thanks Andrzej!

> FilterIndexReader doesn't work correctly with post-flex SegmentMerger
> -
>
> Key: LUCENE-2459
> URL: https://issues.apache.org/jira/browse/LUCENE-2459
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Fix For: 4.0
>
> Attachments: FIRTest.patch, LUCENE-2459.patch
>
>
> IndexWriter.addIndexes(IndexReader...) internally uses SegmentMerger to add 
> data from input index readers. However, SegmentMerger uses the new post-flex 
> API to do this, which bypasses the pre-flex TermEnum/TermPositions API that 
> FilterIndexReader implements. As a result, filtering is not applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2449) Improve random testing

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2449.


Resolution: Fixed

> Improve random testing
> --
>
> Key: LUCENE-2449
> URL: https://issues.apache.org/jira/browse/LUCENE-2449
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2449-3x.patch, LUCENE-2449-trunk.txt, 
> LUCENE-2449.patch, LUCENE-2449.patch, LUCENE-2449.patch, LUCENE-2449.patch
>
>
> We have quite a few random tests, but there's no way to "crank" them.
> The idea here is to add a multiplier which can be increased by a sysprop. For 
> example, we could set this to something higher than 1 for hudson.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2459) FilterIndexReader doesn't work correctly with post-flex SegmentMerger

2010-05-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867165#action_12867165
 ] 

Michael McCandless commented on LUCENE-2459:


Thanks Andrzej -- this is indeed a bug.

Your fix (using flex API on pre-flex API emulation) is good for now, so I think 
we can commit to trunk.  But, we are going to remove the pre-flex APIs (and 
make the flex APIs abstract) soonish, at which point all FilterIndexReader 
impls for trunk will have to cutover to flex.

I'll commit shortly.

> FilterIndexReader doesn't work correctly with post-flex SegmentMerger
> -
>
> Key: LUCENE-2459
> URL: https://issues.apache.org/jira/browse/LUCENE-2459
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Attachments: FIRTest.patch, LUCENE-2459.patch
>
>
> IndexWriter.addIndexes(IndexReader...) internally uses SegmentMerger to add 
> data from input index readers. However, SegmentMerger uses the new post-flex 
> API to do this, which bypasses the pre-flex TermEnum/TermPositions API that 
> FilterIndexReader implements. As a result, filtering is not applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-05-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867164#action_12867164
 ] 

Michael McCandless commented on LUCENE-2393:


Patch looks good Tom!  I'll re-merge my small changes from the prior patch, add 
a CHANGES, and commit.

I don't think we need to upgrade to CL processing lib...

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, 
> LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a pair of command line utilities that provide information on the 
> total number of occurrences of a term in a Lucene index.  The first  takes a 
> field name, term, and index directory and outputs the document frequency for 
> the term and the total number of occurrences of the term in the index (i.e. 
> the sum of the tf of the term for each document).   The second reads the 
> index to determine the top N most frequent terms (by document frequency) and 
> then outputs a list of those terms along with  the document frequency and the 
> total number of occurrences of the term. Both utilities are useful for 
> estimating the size of the term's entry in the *prx files and consequent Disk 
> I/O demands. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2449) Improve random testing

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2449:
---

Attachment: LUCENE-2449-3x.patch
LUCENE-2449-trunk.txt

Patch for trunk & 3x, doing the cloning correctly.

> Improve random testing
> --
>
> Key: LUCENE-2449
> URL: https://issues.apache.org/jira/browse/LUCENE-2449
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2449-3x.patch, LUCENE-2449-trunk.txt, 
> LUCENE-2449.patch, LUCENE-2449.patch, LUCENE-2449.patch, LUCENE-2449.patch
>
>
> We have quite a few random tests, but there's no way to "crank" them.
> The idea here is to add a multiplier which can be increased by a sysprop. For 
> example, we could set this to something higher than 1 for hudson.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867151#action_12867151
 ] 

Robert Muir commented on LUCENE-2458:
-

{quote}
An attribute that says "these tokens go together" or "these tokens should be 
considered one unit" seems like nice generic functionality, and is unrelated to 
any specific language or search feature.
{quote}

No,  if they are one unit for search, they are one token.

Instead the tokenizer should be fixed so that they are one token, instead of 
making all languages suffer for the lack of a crappy english tokenizer.


> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-2449) Improve random testing

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-2449:


  Assignee: Michael McCandless

I didn't quite fix this correctly -- left out the clone() in DBLRU when 
promoting the entry!

> Improve random testing
> --
>
> Key: LUCENE-2449
> URL: https://issues.apache.org/jira/browse/LUCENE-2449
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2449.patch, LUCENE-2449.patch, LUCENE-2449.patch, 
> LUCENE-2449.patch
>
>
> We have quite a few random tests, but there's no way to "crank" them.
> The idea here is to add a multiplier which can be increased by a sysprop. For 
> example, we could set this to something higher than 1 for hudson.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867149#action_12867149
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. We're conflating high level user syntax and the underlying implementation.

Then you have no problem if we form phrase queries for all adjacent english 
words, like we do for chinese.

Perhaps then you will aware of how wrong this is, this hack designed to make 
open compounds match hyphenated compounds in english, or whatever it is.

You are conflating english syntax and word formation into the query parser 
itself.


> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1163) Solr Explorer - A generic GWT client for Solr

2010-05-13 Thread Uri Boness (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867148#action_12867148
 ] 

Uri Boness commented on SOLR-1163:
--

@Peter

Try removing the trailing slash in the URL (I know in the next version I'll 
make sure to handle these edge cases)

@Lance
Indeed the limitation of the current version is that it uses XSS for the 
communication with Solr. This limits it to GET methods. In the upcoming 
release, since a server side layer is added, this will be solved and all 
communication will be done using POST via a proxy servlet.

The next version is basically ready... the only thing that I still have to fix 
are some UI issues with IE and a few with Chrome.

> Solr Explorer - A generic GWT client for Solr
> -
>
> Key: SOLR-1163
> URL: https://issues.apache.org/jira/browse/SOLR-1163
> Project: Solr
>  Issue Type: New Feature
>  Components: web gui
>Affects Versions: 1.3
>Reporter: Uri Boness
> Attachments: graphics.zip, SOLR-1163.zip, SOLR-1163.zip, 
> solr-explorer.patch, solr-explorer.patch
>
>
> The attached patch is a GWT generic client for solr. It is currently 
> standalone, meaning that once built, one can open the generated HTML file in 
> a browser and communicate with any deployed solr. It is configured with it's 
> own configuration file, where one can configure the solr instance/core to 
> connect to. Since it's currently standalone and completely client side based, 
> it uses JSON with padding (cross-side scripting) to connect to remote solr 
> servers. Some of the supported features:
> - Simple query search
> - Sorting - one can dynamically define new sort criterias
> - Search results are rendered very much like Google search results are 
> rendered. It is also possible to view all stored field values for every hit. 
> - Custom hit rendering - It is possible to show thumbnails (images) per hit 
> and also customize a view for a hit based on html templates
> - Faceting - one can dynamically define field and query facets via the UI. it 
> is also possible to pre-configure these facets in the configuration file.
> - Highlighting - you can dynamically configure highlighting. it can also be 
> pre-configured in the configuration file
> - Spellchecking - you can dynamically configure spell checking. Can also be 
> done in the configuration file. Supports collation. It is also possible to 
> send "build" and "reload" commands.
> - Data import handler - if used, it is possible to send a "full-import" and 
> "status" command ("delta-import" is not implemented yet, but it's easy to add)
> - Console - For development time, there's a small console which can help to 
> better understand what's going on behind the scenes. One can use it to:
> ** view the client logs
> ** browse the solr scheme
> ** View a break down of the current search context
> ** View a break down of the query URL that is sent to solr
> ** View the raw JSON response returning from Solr
> This client is actually a platform that can be greatly extended for more 
> things. The goal is to have a client where the explorer part is just one view 
> of it. Other future views include: Monitoring, Administration, Query Builder, 
> DataImportHandler configuration, and more...
> To get a better view of what's currently possible. We've set up a public 
> version of this client at: http://search.jteam.nl/explorer. This client is 
> configured with one solr instance where crawled YouTube movies where indexed. 
> You can also check out a screencast for this deployed client: 
> http://search.jteam.nl/help
> The patch created a new folder in the contrib. directory. Since the patch 
> doesn't contain binaries, an additional zip file is provides that needs to be 
> extract to add all the required graphics. This module is maven2 based and is 
> configured in such a way that all GWT related tools/libraries are 
> automatically downloaded when the modules is compiled. One of the artifacts 
> of the build is a war file which can be deployed in any servlet container.
> NOTE: this client works best on WebKit based browsers (for performance 
> reason) but also works on firefox and ie 7+. That said, it should be taken 
> into account that it is still under development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867147#action_12867147
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq This is why I like the token attr based solution

+1

Although I think it's more general than "de-compounding".
An attribute that says "these tokens go together" or "these tokens should be 
considered one unit" seems like nice generic functionality, and is unrelated to 
any specific language or search feature.


> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867143#action_12867143
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq. Instead the queryparser should only form phrasequeries when you use double 
quotes, just like the documentation says.

We're conflating high level user syntax and the underlying implementation.

'text:Ready' says "search for the word 'ready' in the field 'text'"... the fact 
that an underlying term query of 'text:readi' (after lowercasing, stemming, 
etc) is not incorrect, it's simply the closest match to what the user is asking 
for given the details of analysis.  Likewise, a user query of 'text:ak-47'  may 
end up as a phrase query of "ak 47" because that's the closest representation 
in the index (the user doesn't necessarily know that the analysis of the field 
splits on dashes).

Likewise, a user query of text:"foo bar" is a high level way of saying "search 
for the word foo immediately followed by the word bar".  It is *not* saying 
"make a Lucene phrase query object with 2 terms".  Synonyms, common grams, or 
other analysis methods may in fact turn this into a single term query.

> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1834) Document level security

2010-05-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867135#action_12867135
 ] 

Karl Wright commented on SOLR-1834:
---

I've attached what I think to be the correct code to structure the LCF security 
support as two plugins into this framework.  The first is a security provider, 
the second is a model.  In order to use this with LCF, you still need to set up 
a schema consistent with SOLR-1895, and you would also need the schema addition 
that this framework provides.

The SOLR-1834-with-LCF.patch file is an SVN diff against Solr trunk.  I needed 
to make a number of changes to build.xml to get it to work in the current trunk 
environment.  Also, I needed to comment out the @override commands for some 
reason - but still, everything looked good.


> Document level security
> ---
>
> Key: SOLR-1834
> URL: https://issues.apache.org/jira/browse/SOLR-1834
> Project: Solr
>  Issue Type: New Feature
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Anders Rask
> Attachments: html.rar, SOLR-1834-with-LCF.patch, SOLR-1834.patch
>
>
> Attached to this issue is a patch that includes a framework for enabling 
> document level security in Solr as a search component. I did this as a Master 
> thesis project at Findwise in Stockholm and Findwise has now decided to 
> contribute it back to the community. The component was developed in spring 
> 2009 and has been in use at a customer since autumn the same year.
> There is a simple demo application up at 
> http://demo.findwise.se:8880/SolrSecurity/ which also explains more about the 
> component and how to set it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1834) Document level security

2010-05-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated SOLR-1834:
--

Attachment: SOLR-1834-with-LCF.patch

> Document level security
> ---
>
> Key: SOLR-1834
> URL: https://issues.apache.org/jira/browse/SOLR-1834
> Project: Solr
>  Issue Type: New Feature
>  Components: SearchComponents - other
>Affects Versions: 1.4
>Reporter: Anders Rask
> Attachments: html.rar, SOLR-1834-with-LCF.patch, SOLR-1834.patch
>
>
> Attached to this issue is a patch that includes a framework for enabling 
> document level security in Solr as a search component. I did this as a Master 
> thesis project at Findwise in Stockholm and Findwise has now decided to 
> contribute it back to the community. The component was developed in spring 
> 2009 and has been in use at a customer since autumn the same year.
> There is a simple demo application up at 
> http://demo.findwise.se:8880/SolrSecurity/ which also explains more about the 
> component and how to set it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2459) FilterIndexReader doesn't work correctly with post-flex SegmentMerger

2010-05-13 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated LUCENE-2459:
--

Attachment: LUCENE-2459.patch

The test passes with this patch. I'm not completely sure it coveres all cases, 
though.

> FilterIndexReader doesn't work correctly with post-flex SegmentMerger
> -
>
> Key: LUCENE-2459
> URL: https://issues.apache.org/jira/browse/LUCENE-2459
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Attachments: FIRTest.patch, LUCENE-2459.patch
>
>
> IndexWriter.addIndexes(IndexReader...) internally uses SegmentMerger to add 
> data from input index readers. However, SegmentMerger uses the new post-flex 
> API to do this, which bypasses the pre-flex TermEnum/TermPositions API that 
> FilterIndexReader implements. As a result, filtering is not applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2459) FilterIndexReader doesn't work correctly with post-flex SegmentMerger

2010-05-13 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated LUCENE-2459:
--

Attachment: FIRTest.patch

Modified unit test to illustrate the problem.

> FilterIndexReader doesn't work correctly with post-flex SegmentMerger
> -
>
> Key: LUCENE-2459
> URL: https://issues.apache.org/jira/browse/LUCENE-2459
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Andrzej Bialecki 
> Attachments: FIRTest.patch
>
>
> IndexWriter.addIndexes(IndexReader...) internally uses SegmentMerger to add 
> data from input index readers. However, SegmentMerger uses the new post-flex 
> API to do this, which bypasses the pre-flex TermEnum/TermPositions API that 
> FilterIndexReader implements. As a result, filtering is not applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2459) FilterIndexReader doesn't work correctly with post-flex SegmentMerger

2010-05-13 Thread Andrzej Bialecki (JIRA)

FilterIndexReader doesn't work correctly with post-flex SegmentMerger
-

 Key: LUCENE-2459
 URL: https://issues.apache.org/jira/browse/LUCENE-2459
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
 Attachments: FIRTest.patch

IndexWriter.addIndexes(IndexReader...) internally uses SegmentMerger to add 
data from input index readers. However, SegmentMerger uses the new post-flex 
API to do this, which bypasses the pre-flex TermEnum/TermPositions API that 
FilterIndexReader implements. As a result, filtering is not applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867122#action_12867122
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. When StandardAnalyzer splits a part-number-like token, it should do so as 
well.

I don't think StandardAnalyzer should do any such thing. Maybe in some screwed 
up search engine biased towards english and analyzers have to work around it, 
then EnglishAnalyzer would do this, but not StandardAnalyzer.

And now you see why this is no solution at all, we will only then end up 
arguing about the toggle for this aweful hack in more places!

Instead, the tokenizer used for English should tokenize English better, rather 
than hacking *the entire search engine* around it.

> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-236) Field collapsing

2010-05-13 Thread Sergey Shinderuk (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867121#action_12867121
 ] 

Sergey Shinderuk commented on SOLR-236:
---

@Claus
I faced the same issue. Did you find any solution or maybe workaround?

When collapsing is enabled, numFound is equal to the number of rows requested 
and NOT the total number of distinct documents found.

I applied the latest SOLR-236-trunk.patch to the trunk checked out on the date 
of patch, because patching the latest revision fails.
Am I doing something wrong?

I want to collapse near-duplicate documents in search results based on document 
signature. But with this issue I can't paginate through results, because I 
don't know how many.

Besides, an article at 
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ 
shows examples with correct numFound returned. How can I get it working???

> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.5
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
> field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
> quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867117#action_12867117
 ] 

Uwe Schindler edited comment on LUCENE-2458 at 5/13/10 7:50 AM:


Sorry for intervening,

I am in the same opinion like Hoss:
A lot of people are common to be able to create phrases in search engines by 
appending words with dashes (which StandardAnalyzer is perfectly doing with the 
current query parser impl). As quotes are slower to write, I e.g. always use 
this approach to search for phrases in Google this-is-a-phrase, which works 
always and brings identical results like "this is a phrase" (only ranking is 
sometimes slightly different in Google).

So we should have at least some possibility to switch the behavior on that 
creates phrase queries out of multiple tokens with posIncr>0 -- but I am +1 on 
fixing the problem for non-whitespace languages like cjk. Its also broken, that 
QueryParser parses whitespace in its javacc grammar, in my opinion, this should 
be done by the analyzer (and not partly by analyzer and QP grammar).

In addition: I just bring in again non-compounds like product ids...

  was (Author: thetaphi):
Sorry for intervening,

I am in the same opinion like Hoss:
A lot of people are common to be able to create phrases in search engines by 
appending words with dashes (which StandardAnalyzer is perfectly doing with the 
current query parser impl). As quotes are slower to write, I e.g. always use 
this approach to search for phrases in Google "this-is-a-phrase", which works 
always and brings identical results like "this is a phrase" (only ranking is 
sometimes slightly different in Google).

So we should have at least some possibility to switch the behavior on that 
creates phrase queries out of multiple tokens with posIncr>0 -- but I am +1 on 
fixing the problem for non-whitespace languages like cjk. Its also broken, that 
QueryParser parses whitespace in its javacc grammar, in my opinion, this should 
be done by the analyzer (and not partly by analyzer and QP grammar).
  
> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867117#action_12867117
 ] 

Uwe Schindler commented on LUCENE-2458:
---

Sorry for intervening,

I am in the same opinion like Hoss:
A lot of people are common to be able to create phrases in search engines by 
appending words with dashes (which StandardAnalyzer is perfectly doing with the 
current query parser impl). As quotes are slower to write, I e.g. always use 
this approach to search for phrases in Google "this-is-a-phrase", which works 
always and brings identical results like "this is a phrase" (only ranking is 
sometimes slightly different in Google).

So we should have at least some possibility to switch the behavior on that 
creates phrase queries out of multiple tokens with posIncr>0 -- but I am +1 on 
fixing the problem for non-whitespace languages like cjk. Its also broken, that 
QueryParser parses whitespace in its javacc grammar, in my opinion, this should 
be done by the analyzer (and not partly by analyzer and QP grammar).

> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867112#action_12867112
 ] 

Robert Muir commented on LUCENE-2458:
-

{quote}
This is why I like the token attr based solution
{quote}

I am, and will always be, -1 to this solution. Why can't we try to think about 
lucene from a proper internationalization architecture perspective? 

You shouldnt design apis around "e-mail" phenomena in english, thats absurd.

{quote}
BTW, this appears to not be an English-only need; this page
(http://www.seobythesea.com/?p=1206) lists these example languages as
also using "English-like" compound words: "Some example languages that
use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
Norwegian, Swedish, and Yiddish."
{quote}

Please don't try to insinuate that phrases are the way you should handle 
compound terms in these languages unless you have some actual evidence that 
they should be used instead of "normal decompounding".

These languages have different syntax and word formation, and its simply not 
appropriate.


> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain

2010-05-13 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2450:
---

Attachment: LUCENE-2450.patch

New patch attached.

This patch adds a new pipeline stage called AppendingStage.  You
provide it multiple things to analyze (currently as a String[], but we
can generalize that), and it will step through them one at a time,
logically appending their tokens.

You also give it posIncrGap and offsetGap, which it adds in on
switching to the next field.

I think this is a compelling way to handle fields with multiple
values, and it can make our "decouple indexing from analysis" even
stronger.

Ie, today indexer is hardwired to call analyzer's
getPositionIncrementGap/getOffsetGap.

But with this AppendingStage approach, how multi-valued fields are
appended is purely an analysis detail, hidden to the indexer.  EG you
could make a stage that inserts some kind of marker token on each
field transition, instead.  And since it's a fully pluggable stage,
you're free to move it anywhere (beginning, middle, end) in your
pipeline.


> Explore write-once attr bindings in the analysis chain
> --
>
> Key: LUCENE-2450
> URL: https://issues.apache.org/jira/browse/LUCENE-2450
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
> Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py
>
>
> I'd like to propose a new means of tracking attrs through the analysis
> chain, whereby a given stage in the pipeline cannot overwrite attrs
> from stages before it (write once).  It can only write to new attrs
> (possibly w/ the same name) that future stages can see; it can never
> alter the attrs or bindings from the prior stages.
> I coded up a prototype chain in python (I'll attach), showing the
> equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter ->
> Indexer.
> Each stage "sees" a frozen namespace of attr bindings as its input;
> these attrs are all read-only from its standpoint.  Then, it writes to
> an "output namespace", which is read/write, eg it can add new attrs,
> remove attrs from its input, change the values of attrs.  If that
> stage doesn't alter a given attr it "passes through", unchanged.
> This would be an enormous change to how attrs are managed... so this
> is very very exploratory at this point.  Once we decouple indexer from
> analysis, creating such an alternate chain should be possible -- it'd
> at least be a good test that we've decoupled enough :)
> I think the idea offers some compelling improvements over the "global
> read/write namespace" (AttrFactory) approach we have today:
>   * Injection filters can be more efficient -- they need not
> capture/restoreState at all
>   * No more need for the initial tokenizer to "clear all attrs" --
> each stage becomes responsible for clearing the attrs it "owns"
>   * You can truly stack stages (vs having to make a custom
> AttrFactory) -- eg you could make a Bocu1 stage which can stack
> onto any other stage.  It'd look up the CharTermAttr, remove it
> from its output namespace, and add a BytesRefTermAttr.
>   * Indexer should be more efficient, in that it doesn't need to
> re-get the attrs on each next() -- it gets them up front, and
> re-uses them.
> Note that in this model, the indexer itself is just another stage in
> the pipeline, so you could do some wild things like use 2 indexer
> stages (writing to different indexes, or maybe the same index but
> somehow with further processing or something).
> Also, in this approach, the analysis chain is more informed about the
> what each stage is allowed to change, up front after the chain is
> created.  EG (say) we will know that only 2 stages write to the term
> attr, and that only 1 writes posIncr/offset attrs, etc.  Not sure
> if/how this helps us... but it's more strongly typed than what we have
> today.
> I think we could use a similar chain for processing a document at the
> field level, ie, different stages could add/remove/change different
> fields in the doc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2450) Explore write-once attr bindings in the analysis chain

2010-05-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867104#action_12867104
 ] 

Uwe Schindler commented on LUCENE-2450:
---

This sounds interesting, currently i only see problems with multi-interface 
attr implementations, but we can simply drop them for trunk. I have to go 
through the patch and understand more how it is intended to work.

> Explore write-once attr bindings in the analysis chain
> --
>
> Key: LUCENE-2450
> URL: https://issues.apache.org/jira/browse/LUCENE-2450
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
> Attachments: LUCENE-2450.patch, pipeline.py
>
>
> I'd like to propose a new means of tracking attrs through the analysis
> chain, whereby a given stage in the pipeline cannot overwrite attrs
> from stages before it (write once).  It can only write to new attrs
> (possibly w/ the same name) that future stages can see; it can never
> alter the attrs or bindings from the prior stages.
> I coded up a prototype chain in python (I'll attach), showing the
> equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter ->
> Indexer.
> Each stage "sees" a frozen namespace of attr bindings as its input;
> these attrs are all read-only from its standpoint.  Then, it writes to
> an "output namespace", which is read/write, eg it can add new attrs,
> remove attrs from its input, change the values of attrs.  If that
> stage doesn't alter a given attr it "passes through", unchanged.
> This would be an enormous change to how attrs are managed... so this
> is very very exploratory at this point.  Once we decouple indexer from
> analysis, creating such an alternate chain should be possible -- it'd
> at least be a good test that we've decoupled enough :)
> I think the idea offers some compelling improvements over the "global
> read/write namespace" (AttrFactory) approach we have today:
>   * Injection filters can be more efficient -- they need not
> capture/restoreState at all
>   * No more need for the initial tokenizer to "clear all attrs" --
> each stage becomes responsible for clearing the attrs it "owns"
>   * You can truly stack stages (vs having to make a custom
> AttrFactory) -- eg you could make a Bocu1 stage which can stack
> onto any other stage.  It'd look up the CharTermAttr, remove it
> from its output namespace, and add a BytesRefTermAttr.
>   * Indexer should be more efficient, in that it doesn't need to
> re-get the attrs on each next() -- it gets them up front, and
> re-uses them.
> Note that in this model, the indexer itself is just another stage in
> the pipeline, so you could do some wild things like use 2 indexer
> stages (writing to different indexes, or maybe the same index but
> somehow with further processing or something).
> Also, in this approach, the analysis chain is more informed about the
> what each stage is allowed to change, up front after the chain is
> created.  EG (say) we will know that only 2 stages write to the term
> attr, and that only 1 writes posIncr/offset attrs, etc.  Not sure
> if/how this helps us... but it's more strongly typed than what we have
> today.
> I think we could use a similar chain for processing a document at the
> field level, ie, different stages could add/remove/change different
> fields in the doc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-2448) Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Incompatible format version: 2 expected 1 or lower

2010-05-13 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-2448.
-

Resolution: Invalid

This is not a bug.

Lucene 3.0 changed the stored fields to no longer support compressed fields. To 
mark an already "converted" (removed compression) stored fields file, the 
version is upgraded to the 3.0 one. Lucene 2.9 is then no longer able to read 
the index because of the upgraded version.

Theoretically it could - when i implemented the stored field upgrade with 
Michael Busch, I thought about adding support for the higher version as an 
"alias" to 2.9, but the release schedule for 2.9 was too fast. The best 
solution would be to force 2.9 to *always* write stored field segments with the 
old version, but 2.9 should also be able to *readonly* the new version 
signature.

> Exception in thread "main" org.apache.lucene.index.CorruptIndexException: 
> Incompatible format version: 2 expected 1 or lower
> 
>
> Key: LUCENE-2448
> URL: https://issues.apache.org/jira/browse/LUCENE-2448
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9.1
> Environment: Windows Vista Home Premium, Lucene version 2.9.1, JRE6
>Reporter: Bill Herbert
> Attachments: _0.cfs, _0.cfx, AddressBookSearcher.java, segments.gen, 
> segments_2
>
>
> The attached code is indended to search the contents of an indexed file.  
> Upon execution, it generates the following stacktrace.  I will appreciate any 
> assistance in interpreting and correcting this error.  Also, how should I 
> address the warning about the depreciated API.
> Thanks,  Bill
> C:\lucene-3.0.1\src>javac AddressBookSearcher.java
> Note: AddressBookSearcher.java uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> C:\lucene-3.0.1\src>java AddressBookSearcher
> Exception in thread "main" org.apache.lucene.index.CorruptIndexException: 
> Incomp
> atible format version: 2 expected 1 or lower
> at org.apache.lucene.index.FieldsReader.(FieldsReader.java:117)
> at 
> org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(Segme
> ntReader.java:277)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
> at 
> org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:1
> 04)
> at 
> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java
> :76)
> at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
> s.java:704)
> at 
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:243)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:222)
> at 
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:62)
> at AddressBookSearcher.main(AddressBookSearcher.java:22)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present

2010-05-13 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867094#action_12867094
 ] 

Paul Elschot commented on LUCENE-1252:
--

LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing 
only the first matching phrase to determine a possible match, and by delaying 
the computation of the remaining matches until score() is called.

> Avoid using positions when not all required terms are present
> -
>
> Key: LUCENE-1252
> URL: https://issues.apache.org/jira/browse/LUCENE-1252
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
>
> In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
> currently next() and skipTo() will use position information even when other 
> parts of the query cannot match because some required terms are not present.
> This could be avoided by adding some methods to Scorer that relax the 
> postcondition of next() and skipTo() to something like "all required terms 
> are present, but no position info was checked yet", and implementing these 
> methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
> SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867093#action_12867093
 ] 

Michael McCandless commented on LUCENE-2458:


I'd like a solution that lets us have our cake and eat it too...

Ie, we clearly have to fix the disastrous out-of-the-box experience
that non-whitespace languages (CJK) now have with Lucene.  This is
clear.

But, when an analyzer that splits English-like compound words (eg
e-mail -> e mail) is used, I think this should also continue to create
a PhraseQuery, out-of-the-box.

Today when a user searches for "e-mail", s/he will correctly see only
"email/e-mail" hit & highlighted in the search results.  If we break
this behaviour, ie no longer produce a PQ out-of-the-box, suddenly
hits with just "mail" will be returned, which is bad.

So a single setter on QueryParser w/ a global default is not a good
enough solution -- it means either CJK or English-like compound words
will be bad.

This is why I like the token attr based solution -- those analyzers
that are doing "English-like" de-compounding can mark the tokens as
such.  Then QueryParser can notice this attr and (if configured to do so, via
setter), create a PhraseQuery out of that sequence of tokens.

This then pushes the decision of which series of Tokens are produced
via "English-like" de-compounding.  EG I think WordDelimiterFilter
should be default mark its tokens as such (the majority of users use
it this way).  When StandardAnalyzer splits a part-number-like token,
it should do so as well.

This isn't a perfect solution: it's not easy, in general, for an
analyzer to "know" its splits are "English-like" de-compounding, but
this would still give us a solid step forward (progress not
perfection).  And, since the decision point is now in the analyzer,
per-token, it gives users complete flexibility to customize as needed.

BTW, this appears to not be an English-only need; this page
(http://www.seobythesea.com/?p=1206) lists these example languages as
also using "English-like" compound words: "Some example languages that
use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
Norwegian, Swedish, and Yiddish."



> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt

2010-05-13 Thread Martin Blech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867087#action_12867087
 ] 

Martin Blech commented on LUCENE-2239:
--

It is causing an issue in a JAX-RS application that uses Sun's Jersey reference 
implementation and is deployed on the Grizzly servlet container. Apparently, 
Grizzly's ThreadPool implemenation uses Thread.interrupt() extensively.

> Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
> --
>
> Key: LUCENE-2239
> URL: https://issues.apache.org/jira/browse/LUCENE-2239
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Store
>Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
> Attachments: LUCENE-2239.patch
>
>
> I created this issue as a spin off from 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e
> We should decide what to do with NIOFSDirectory, if we want to keep it as the 
> default on none-windows platforms and how we want to document this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1163) Solr Explorer - A generic GWT client for Solr

2010-05-13 Thread Peter Sturge (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867080#action_12867080
 ] 

Peter Sturge commented on SOLR-1163:


Hi Uri,

The configuration for url connections in {solr-explorer.xml} looks pretty 
straightforward. For the example instance, there is no 'named' core, so I used 
an empty string here. 
For my own core, I used the name ('active'), and the URLs work fine when put 
straight into a browser:

So, this works:
{   http://localhost:8983/solr/active/select/?&q=*:*&wt=json&indent=on}


But this gives the JSON timeout error:
{code:title=solr-explorer.xml [excerpt]|borderStyle=solid}


http://localhost:8983/solr/"/>


*:*



name
features


{code}

(in FireFox 3.6)

Thanks,
Peter


> Solr Explorer - A generic GWT client for Solr
> -
>
> Key: SOLR-1163
> URL: https://issues.apache.org/jira/browse/SOLR-1163
> Project: Solr
>  Issue Type: New Feature
>  Components: web gui
>Affects Versions: 1.3
>Reporter: Uri Boness
> Attachments: graphics.zip, SOLR-1163.zip, SOLR-1163.zip, 
> solr-explorer.patch, solr-explorer.patch
>
>
> The attached patch is a GWT generic client for solr. It is currently 
> standalone, meaning that once built, one can open the generated HTML file in 
> a browser and communicate with any deployed solr. It is configured with it's 
> own configuration file, where one can configure the solr instance/core to 
> connect to. Since it's currently standalone and completely client side based, 
> it uses JSON with padding (cross-side scripting) to connect to remote solr 
> servers. Some of the supported features:
> - Simple query search
> - Sorting - one can dynamically define new sort criterias
> - Search results are rendered very much like Google search results are 
> rendered. It is also possible to view all stored field values for every hit. 
> - Custom hit rendering - It is possible to show thumbnails (images) per hit 
> and also customize a view for a hit based on html templates
> - Faceting - one can dynamically define field and query facets via the UI. it 
> is also possible to pre-configure these facets in the configuration file.
> - Highlighting - you can dynamically configure highlighting. it can also be 
> pre-configured in the configuration file
> - Spellchecking - you can dynamically configure spell checking. Can also be 
> done in the configuration file. Supports collation. It is also possible to 
> send "build" and "reload" commands.
> - Data import handler - if used, it is possible to send a "full-import" and 
> "status" command ("delta-import" is not implemented yet, but it's easy to add)
> - Console - For development time, there's a small console which can help to 
> better understand what's going on behind the scenes. One can use it to:
> ** view the client logs
> ** browse the solr scheme
> ** View a break down of the current search context
> ** View a break down of the query URL that is sent to solr
> ** View the raw JSON response returning from Solr
> This client is actually a platform that can be greatly extended for more 
> things. The goal is to have a client where the explorer part is just one view 
> of it. Other future views include: Monitoring, Administration, Query Builder, 
> DataImportHandler configuration, and more...
> To get a better view of what's currently possible. We've set up a public 
> version of this client at: http://search.jteam.nl/explorer. This client is 
> configured with one solr instance where crawled YouTube movies where indexed. 
> You can also check out a screencast for this deployed client: 
> http://search.jteam.nl/help
> The patch created a new folder in the contrib. directory. Since the patch 
> doesn't contain binaries, an additional zip file is provides that needs to be 
> extract to add all the required graphics. This module is maven2 based and is 
> configured in such a way that all GWT related tools/libraries are 
> automatically downloaded when the modules is compiled. One of the artifacts 
> of the build is a war file which can be deployed in any servlet container.
> NOTE: this client works best on WebKit based browsers (for performance 
> reason) but also works on firefox and ie 7+. That said, it should be taken 
> into account that it is still under development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Hudson build is back to normal : Lucene-3.x #10

2010-05-13 Thread Apache Hudson Server

See 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

53 matches

Mail list logo