[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113912#comment-13113912
 ] 

Chris Male commented on LUCENE-1536:


Actually, the more I look at the nocommits, the less I like what I've 
suggested.  I think having getRandomAccessBits as it is in the patch is fine.  
But I like we should maybe make setLiveDocsOnly and 
setAllowRandomAccessFiltering 1st class features of the Bits interface.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Prettify JS and CSS exceluded from Javadocs

2011-09-24 Thread Shai Erera
Hi Steve,

As I noted before, jarring prettify won't solve the problem entirely, as the
references in the HTML will point to an incorrect location.

Why don't we just package prettify.js in the jar like we do with
stylesheet+prettify.css? We only need this .js as we only write in Java ...

It's a tiny file, and it will simplify the whole process. Wha do you think?

Shai

On Thu, Sep 22, 2011 at 5:05 PM, Steven A Rowe sar...@syr.edu wrote:

 The patch I gave for lucene/contrib-build.xml's javadocs target was wrong
 (I placed the nested tag outside of the jarify invocation).  Here's a
 fixed patch:

 Index: lucene/contrib/contrib-build.xml
 ===
 --- lucene/contrib/contrib-build.xml(revision 1165174)
 +++ lucene/contrib/contrib-build.xml(revision )
 @@ -95,7 +95,11 @@
 packageset dir=${src.dir}/
 /sources
/invoke-javadoc
 -  jarify basedir=${javadoc.dir}/contrib-${name}
 destfile=${build.dir}/${final.name}-javadoc.jar/
 +  jarify basedir=${javadoc.dir}/contrib-${name}
 destfile=${build.dir}/${final.name}-javadoc.jar
 +nested
 +  fileset dir=${prettify.dir}/
 +/nested
 +   /jarify
 /sequential
   /target




[jira] [Commented] (LUCENE-3452) The native FS lock used in test-framework's o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system

2011-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113932#comment-13113932
 ] 

Uwe Schindler commented on LUCENE-3452:
---

bq. On the first pass, the build hung in the middle of the lucene core tests - 
I killed the process after half an hour with no output. I restarted the tests, 
and the build made it through the Lucene tests, but then at least one Solr core 
test failed.

We had this several times on Jenkins, too. I killed the JVM approx 4 times the 
last 2 weeks.

 The native FS lock used in test-framework's 
 o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system
 --

 Key: LUCENE-3452
 URL: https://issues.apache.org/jira/browse/LUCENE-3452
 Project: Lucene - Java
  Issue Type: Bug
  Components: general/test
Affects Versions: 3.4, 4.0
Reporter: Steven Rowe
Priority: Minor
 Attachments: LUCENE-3452.patch


 {{LuceneJUnitResultFormatter}} uses a lock to buffer test suites' output, so 
 that when run in parallel, they don't interrupt each other when they are 
 displayed on the console.
 The current implementation uses a fixed directory ({{lucene_junit_lock/}} in 
 {{java.io.tmpdir}} (by default {{/tmp/}} on Unix/Linux systems) as the 
 location of this lock.  This functionality was introduced on SOLR-1835.
 As Shawn Heisey reported on SOLR-2739, some tests fail when run as root, but 
 succeed when run as a non-root user.  
 On #lucene IRC today, Shawn wrote:
 {quote}
 (2:06:07 PM) elyograg: Now that I know I can't run the tests as root, I have 
 discovered /tmp/lucene_junit_lock.  Once you run the tests as user A, you 
 cannot run them again as user B until that directory is deleted, and only 
 root or the original user can do so.
 {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113957#comment-13113957
 ] 

Michael McCandless commented on LUCENE-3449:


bq. Nomatter what, this loop needs 2 ifs per cycle

Duh, I was wrong about this!

We just need to change the sentinel value returned by nextSetBit when
there is no next set bit, from -1 to MAX_INT.  In fact we did this for
DISI.nextDoc, for the same reason (saves an if per cycle).

Then you just rotate the loop:

{noformat}
  if (bits.length() != 0) {
int bit = bits.nextSetBit(0);
final int limit = bits.length()-1;
while (bit  limit) {
  // ...do something with bit...
  bit = bits.nextSetBit(1+bit);
}

if (bit == bits.length()-1) {
  // ...do something with bit...
}
  }
{noformat}

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-24 Thread hadas raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113958#comment-13113958
 ] 

hadas raviv commented on LUCENE-2959:
-

Hi,

First of all, I would like to thank you for the great contribution you made by 
adding the state of the art ranking methods to lucene. I was waiting for these  
features for a long time, since they enable an IR researcher like me to use 
lucene, which is a powerful tool, for research purposes.

I downloaded the last version of lucene trunk and played a little with the 
models you implemented. There is question I have and I would really appreciate 
your answer (my apology in advance - I'm new to lucene so maybe this question 
is trivial for you):

I saw that you didn't change the default implementation of lucene for coding 
the document length which is used for ranking in language models (one byte for 
coding the document length together with boosting). Why did you decide that? Is 
it possible to save the real document length coded in some other way (maybe 
with the new flexible index)? Is there any example for such an implementation? 
It is just that I'm concerned with the effect of using an inaccurate document 
length on results quality. Did you check this issue?

In addition - do you know about intentions to implement some more advanced 
ranking models (such as relevance models, mrf) in the near future?

Thanks in advance,
Hadas

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery

2011-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113960#comment-13113960
 ] 

Michael McCandless commented on LUCENE-3451:


Patch looks great Uwe!

Nice catch on the analyzers removing stop words and then making an all MUST_NOT 
BQ.  But, I think we should throw an exception in this case, since it's a 
horrible trap now?  User will get 0 results but that's flat out silently wrong?

 Remove special handling of pure negative Filters in BooleanFilter, disallow 
 pure negative queries in BooleanQuery
 -

 Key: LUCENE-3451
 URL: https://issues.apache.org/jira/browse/LUCENE-3451
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch


 We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows 
 pure negative Filter clauses. This is not supported by BooleanQuery and 
 confuses users (I think that's the problem in LUCENE-3450).
 The hack is buggy, as it does not respect deleted documents and returns them 
 in its DocIdSet.
 Also we should think about disallowing pure-negative Queries at all and throw 
 UOE.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113961#comment-13113961
 ] 

Dawid Weiss commented on LUCENE-3449:
-

Sorry, but my gut feeling says no to this loop logic. It just seems strangely 
complicated. If adherence to BitSet is not an issue why not:

{code}
for (int i = bs.firstSetBit(); i = 0; i = bs.nextSetBitAfter(i)) {
{code}

this seems clearer on method naming, has a single if... and I think could be 
implemented nearly identically to what's already in the code. We can run 
microbenchmarks for fun and see what comes out better an by what margin. 

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113965#comment-13113965
 ] 

Michael McCandless commented on LUCENE-1536:


bq. But I like we should maybe make setLiveDocsOnly and 
setAllowRandomAccessFiltering 1st class features of the Bits interface.

Hmm... that also makes me a bit nervous ;) Bits is too low-level for
these concepts?  Ie whether a filter/DIS folded in live docs
already, and whether the filter/DIS is best applied by iteration vs by
random access, are higher level filter concepts, not low level Bits
concepts, I think?

Also, Bits by definition is random-access so I don't think it should
have set/getAllowRandomAccessFiltering.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr

2011-09-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113967#comment-13113967
 ] 

Grant Ingersoll commented on LUCENE-3435:
-

A patch would be great for all of these things.  Thanks!

 Create a Size Estimator model for Lucene and Solr
 -

 Key: LUCENE-3435
 URL: https://issues.apache.org/jira/browse/LUCENE-3435
 Project: Lucene - Java
  Issue Type: Task
  Components: core/other
Affects Versions: 4.0
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 It is often handy to be able to estimate the amount of memory and disk space 
 that both Lucene and Solr use, given certain assumptions.  I intend to check 
 in an Excel spreadsheet that allows people to estimate memory and disk usage 
 for trunk.  I propose to put it under dev-tools, as I don't think it should 
 be official documentation just yet and like the IDE stuff, we'll see how well 
 it gets maintained.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113977#comment-13113977
 ] 

Robert Muir commented on LUCENE-2959:
-

{quote}
I saw that you didn't change the default implementation of lucene for coding 
the document length which is used for ranking in language models (one byte for 
coding the document length together with boosting). Why did you decide that?
{quote}

So that you can switch between ranking models without re-indexing.

{quote}
It is just that I'm concerned with the effect of using an inaccurate document 
length on results quality. Did you check this issue?
{quote}

I ran experiments on this a long time ago, the changes were not statistically 
significant.
But, there is an issue open to still switch norms to docvalues fields, for 
other reasons: LUCENE-3221

{quote}
In addition - do you know about intentions to implement some more advanced 
ranking models (such as relevance models, mrf) in the near future?
{quote}

No, there won't be any additional work on this issue, GSOC is over. 






 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-24 Thread sebastian L. (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sebastian L. updated LUCENE-3440:
-

Fix Version/s: 4.0
Affects Version/s: 4.0

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, 
 LUCENE-4.0-SNAPSHOT-3440-3.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113984#comment-13113984
 ] 

Chris Male commented on LUCENE-3312:


Back on this wagon for a bit.

Just wondering about whether we need a StorableFieldType to accompany 
StorableField.

At this stage I've struggling to identify candidate properties for a 
StorableFieldType.  Options include moving the Numeric.DataType and DocValues' 
ValueType to the FieldType.  While I sort of like this idea, it seems to have a 
couple of disadvantages:

- Any FieldTypes passed into NumericField and IndexDocValuesField would have to 
have these properties set from the beginning.  For both of these, this would 
mean it wouldn't be possible to simple initialize a field and then use one of 
the setters to define the Data/ValueType - they would need to be known at 
construction.
- It separates the 'data type' away from the actual value.

If these properties were to stay on StorableFieldType, I can't really see the 
need for a StorableFieldType.

 Break out StorableField from IndexableField
 ---

 Key: LUCENE-3312
 URL: https://issues.apache.org/jira/browse/LUCENE-3312
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
 Fix For: Field Type branch


 In the field type branch we have strongly decoupled
 Document/Field/FieldType impl from the indexer, by having only a
 narrow API (IndexableField) passed to IndexWriter.  This frees apps up
 use their own documents instead of the user-space impls we provide
 in oal.document.
 Similarly, with LUCENE-3309, we've done the same thing on the
 doc/field retrieval side (from IndexReader), with the
 StoredFieldsVisitor.
 But, maybe we should break out StorableField from IndexableField,
 such that when you index a doc you provide two Iterables -- one for the
 IndexableFields and one for the StorableFields.  Either can be null.
 One downside is possible perf hit for fields that are both indexed 
 stored (ie, we visit them twice, lookup their name in a hash twice,
 etc.).  But the upside is a cleaner separation of concerns in API

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3312) Break out StorableField from IndexableField

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113984#comment-13113984
 ] 

Chris Male edited comment on LUCENE-3312 at 9/24/11 2:25 PM:
-

Back on this wagon for a bit.

Just wondering about whether we need a StorableFieldType to accompany 
StorableField.

At this stage I've struggling to identify candidate properties for a 
StorableFieldType.  Options include moving the Numeric.DataType and DocValues' 
ValueType to the FieldType.  While I sort of like this idea, it seems to have a 
couple of disadvantages:

- Any FieldTypes passed into NumericField and IndexDocValuesField would have to 
have these properties set from the beginning.  For both of these, this would 
mean it wouldn't be possible to simple initialize a field and then use one of 
the setters to define the Data/ValueType - they would need to be known at 
construction.
- It separates the 'data type' away from the actual value.

If these properties were to stay on StorableField, I can't really see the need 
for a StorableFieldType.

  was (Author: cmale):
Back on this wagon for a bit.

Just wondering about whether we need a StorableFieldType to accompany 
StorableField.

At this stage I've struggling to identify candidate properties for a 
StorableFieldType.  Options include moving the Numeric.DataType and DocValues' 
ValueType to the FieldType.  While I sort of like this idea, it seems to have a 
couple of disadvantages:

- Any FieldTypes passed into NumericField and IndexDocValuesField would have to 
have these properties set from the beginning.  For both of these, this would 
mean it wouldn't be possible to simple initialize a field and then use one of 
the setters to define the Data/ValueType - they would need to be known at 
construction.
- It separates the 'data type' away from the actual value.

If these properties were to stay on StorableFieldType, I can't really see the 
need for a StorableFieldType.
  
 Break out StorableField from IndexableField
 ---

 Key: LUCENE-3312
 URL: https://issues.apache.org/jira/browse/LUCENE-3312
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
 Fix For: Field Type branch


 In the field type branch we have strongly decoupled
 Document/Field/FieldType impl from the indexer, by having only a
 narrow API (IndexableField) passed to IndexWriter.  This frees apps up
 use their own documents instead of the user-space impls we provide
 in oal.document.
 Similarly, with LUCENE-3309, we've done the same thing on the
 doc/field retrieval side (from IndexReader), with the
 StoredFieldsVisitor.
 But, maybe we should break out StorableField from IndexableField,
 such that when you index a doc you provide two Iterables -- one for the
 IndexableFields and one for the StorableFields.  Either can be null.
 One downside is possible perf hit for fields that are both indexed 
 stored (ie, we visit them twice, lookup their name in a hash twice,
 etc.).  But the upside is a cleaner separation of concerns in API

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113988#comment-13113988
 ] 

Michael McCandless commented on LUCENE-3449:


{quote}
If adherence to BitSet is not an issue why not:

for (int i = bs.firstSetBit(); i = 0; i = bs.nextSetBitAfter(i)) {
this seems clearer on method naming, has a single if... and I think could be 
implemented nearly identically to what's already in the code. We can run 
microbenchmarks for fun and see what comes out better an by what margin.
{quote}

Ooh I love that!

If in fact we can achieve such clean code (above), a clean API ( all
methods require in-bounds index), and not incur added cost in
nextSetBitAfter (vs the nextSetBit we have today) then I agree this
would be the best of all worlds.

I think we should give the sentinal a name (eg FBS.END)?  Then the end
condition can be {{i != FBS.END}}.


 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr

2011-09-24 Thread Erik Hatcher
What about putting this in Google Docs for easy collaboration?  Patching an 
Excel file will be tough to coordinate. 

On Sep 24, 2011, at 9:11, Grant Ingersoll (JIRA) j...@apache.org wrote:

 
[ 
 https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113967#comment-13113967
  ] 
 
 Grant Ingersoll commented on LUCENE-3435:
 -
 
 A patch would be great for all of these things.  Thanks!
 
 Create a Size Estimator model for Lucene and Solr
 -
 
Key: LUCENE-3435
URL: https://issues.apache.org/jira/browse/LUCENE-3435
Project: Lucene - Java
 Issue Type: Task
 Components: core/other
   Affects Versions: 4.0
   Reporter: Grant Ingersoll
   Assignee: Grant Ingersoll
   Priority: Minor
 
 It is often handy to be able to estimate the amount of memory and disk space 
 that both Lucene and Solr use, given certain assumptions.  I intend to check 
 in an Excel spreadsheet that allows people to estimate memory and disk usage 
 for trunk.  I propose to put it under dev-tools, as I don't think it should 
 be official documentation just yet and like the IDE stuff, we'll see how 
 well it gets maintained.
 
 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Robert Muir (JIRA)
remove IndexDocValuesField
--

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir


Its confusing how we present CSF functionality to the user, its actually not a 
field but an attribute of a field like  STORED or INDEXED.

Otherwise, its really hard to think about CSF because there is a mismatch 
between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113997#comment-13113997
 ] 

Chris Male commented on LUCENE-3453:


I'm not sure what the better alternative is, but +1 to removing this class.

 remove IndexDocValuesField
 --

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 Its confusing how we present CSF functionality to the user, its actually not 
 a field but an attribute of a field like  STORED or INDEXED.
 Otherwise, its really hard to think about CSF because there is a mismatch 
 between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2769:
--

Attachment: SOLR-2769-branch_3x.patch

Attaching branch_3x patch, identical except from package name for 
WhitespaceTokenizer

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769.patch, 
 SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2769:
--

Attachment: SOLR-2769-branch_3x.patch

Same for branch

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-09-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2621:


Attachment: LUCENE-2621_rote.patch

Here is a minimal 'rote refactor' for stored fields, there is a lot more to do 
(e.g. filenames/extensions should come from codec, segmentmerger optimizations 
(bulk merging) should not be in the API but customized by the codec, the codec 
name (format) of fields should be recorded in the index, we should implement a 
simpletext version and refactor/generalize, ...)

but more importantly, I think we need to restructure the class hierarchy: Codec 
is a per-field thing currently but I think the name Codec should represent 
the entire index... 

maybe what is Codec now should be named FieldCodec? maybe the parts of 
CodecProvider (e.g. segmentinfosreader, storedfields, etc) should be be moved 
to this new Codec class? in this world maybe PreFlex codec for example returns 
its hardcoded representation for every field since in 3.x this stuff is *not* 
per field, and with more of the back compat code refactored down into PreFlex.

Would be good to come up with a nice class naming/hierarchy that represents 
reality here.

 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2621_rote.patch


 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2769:
--

Attachment: SOLR-2769.patch

Better Javadoc with example XML and link to dictionaries

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl resolved SOLR-2769.
---

Resolution: Fixed

Checked in for trunk and 3x

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114022#comment-13114022
 ] 

Michael McCandless commented on LUCENE-2621:


Awesome!

I think, like the postings, we can add a .merge() method, and the impl for that 
would do bulk-merge when it can?

On the restructuring, maybe we can go back to a PerFieldCodecWrapper, which is 
itself a Codec?  This would simplify CodecProvider back to just being a name - 
Codec instance provider?  We would still use SegmentCodecs/FieldInfo(s) to 
compute/record the codecID, though, in theory this could become private to 
PFCW once it's a Codec again.

 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2621_rote.patch


 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114024#comment-13114024
 ] 

Robert Muir commented on LUCENE-2621:
-

I think so, assuming FieldInfos etc are *also* read/written by the codec.

Then I think PFCW could be an abstract class that writes per-field 
configuration into the index, but for example PreFlexCodec would *not* extend 
this class, as a 3.x index is the same codec across all fields.

I think if we do things this way we have a lot more flexibility with backwards 
compatibility instead of all this if-then-else conditional version-checking 
code when reading these files... 

Really, for example if someone wanted to make a Codec that reads a lucene 2.x 
indexes (compressed fields and all) they should be able to do this if we 
reorganize this right.


 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2621_rote.patch


 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025
 ] 

Dawid Weiss commented on LUCENE-3449:
-

Eh... I shouldn't be throwing suggestions not backed up by patches... ;) I'm 
working on something else tonight, but I'll add it to my queue. If anybody 
(Uwe, Uwe, Uwe! :) wants to give it a take, go ahead.

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025
 ] 

Dawid Weiss edited comment on LUCENE-3449 at 9/24/11 5:54 PM:
--

Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) I'm 
working on something else tonight, but I'll add it to my queue. If anybody 
(Uwe, Uwe, Uwe! :)) wants to give it a take, go ahead.

  was (Author: dweiss):
Eh... I shouldn't be throwing suggestions not backed up by patches... ;) 
I'm working on something else tonight, but I'll add it to my queue. If anybody 
(Uwe, Uwe, Uwe! :) wants to give it a take, go ahead.
  
 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025
 ] 

Dawid Weiss edited comment on LUCENE-3449 at 9/24/11 5:55 PM:
--

Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) I'm 
working on something else tonight, but I'll add it to my queue. If anybody 
(Uwe, Uwe, Uwe! :)) wants to give it a go, go ahead.

  was (Author: dweiss):
Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) 
I'm working on something else tonight, but I'll add it to my queue. If anybody 
(Uwe, Uwe, Uwe! :)) wants to give it a take, go ahead.
  
 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114036#comment-13114036
 ] 

Robert Muir commented on LUCENE-2621:
-

I created a branch 
(https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2621) for extending 
and refactoring the codec API to cover more portions of the index...

I think it would be really nice to flesh this out for 4.0


 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2621_rote.patch


 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114037#comment-13114037
 ] 

Simon Willnauer commented on LUCENE-3453:
-

+1

 remove IndexDocValuesField
 --

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 Its confusing how we present CSF functionality to the user, its actually not 
 a field but an attribute of a field like  STORED or INDEXED.
 Otherwise, its really hard to think about CSF because there is a mismatch 
 between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114041#comment-13114041
 ] 

Robert Muir commented on LUCENE-1536:
-

I didnt look too hard here at whats going on, but maybe we could use the 
RandomAccess marker interface from the jdk?

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Prettify JS and CSS exceluded from Javadocs

2011-09-24 Thread Steven A Rowe
Hi Shai,

Sure, that sounds fine to me.

Steve

From: Shai Erera [mailto:ser...@gmail.com]
Sent: Saturday, September 24, 2011 3:12 AM
To: dev@lucene.apache.org
Subject: Re: Prettify JS and CSS exceluded from Javadocs

Hi Steve,

As I noted before, jarring prettify won't solve the problem entirely, as the 
references in the HTML will point to an incorrect location.

Why don't we just package prettify.js in the jar like we do with 
stylesheet+prettify.css? We only need this .js as we only write in Java ...

It's a tiny file, and it will simplify the whole process. Wha do you think?

Shai
On Thu, Sep 22, 2011 at 5:05 PM, Steven A Rowe 
sar...@syr.edumailto:sar...@syr.edu wrote:
The patch I gave for lucene/contrib-build.xml's javadocs target was wrong (I 
placed the nested tag outside of the jarify invocation).  Here's a fixed 
patch:

Index: lucene/contrib/contrib-build.xml
===
--- lucene/contrib/contrib-build.xml(revision 1165174)
+++ lucene/contrib/contrib-build.xml(revision )
@@ -95,7 +95,11 @@
   packageset dir=${src.dir}/
/sources
  /invoke-javadoc
-  jarify basedir=${javadoc.dir}/contrib-${name} 
destfile=${build.dir}/${final.namehttp://final.name}-javadoc.jar/
+  jarify basedir=${javadoc.dir}/contrib-${name} 
destfile=${build.dir}/${final.namehttp://final.name}-javadoc.jar
+nested
+  fileset dir=${prettify.dir}/
+/nested
+   /jarify
/sequential
  /target



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114055#comment-13114055
 ] 

Uwe Schindler commented on LUCENE-3449:
---

bq (Uwe, Uwe, Uwe! :))

Yes, but today is/was freetime :-)

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114055#comment-13114055
 ] 

Uwe Schindler edited comment on LUCENE-3449 at 9/24/11 7:45 PM:


bq. (Uwe, Uwe, Uwe! :))

Yes, but today is/was freetime :-)

  was (Author: thetaphi):
bq (Uwe, Uwe, Uwe! :))

Yes, but today is/was freetime :-)
  
 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114057#comment-13114057
 ] 

Dawid Weiss commented on LUCENE-3449:
-

I was just teasing you, enjoy your weekend :)

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114062#comment-13114062
 ] 

Jan Høydahl commented on SOLR-2769:
---

Updated documentation:

http://wiki.apache.org/solr/Hunspell
http://wiki.apache.org/solr/HunspellStemFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/LanguageAnalysis#Stemming

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2792) Allow case insensitive Hunspell stemming

2011-09-24 Thread JIRA
Allow case insensitive Hunspell stemming


 Key: SOLR-2792
 URL: https://issues.apache.org/jira/browse/SOLR-2792
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.5, 4.0
Reporter: Jan Høydahl


Same as http://code.google.com/p/lucene-hunspell/issues/detail?id=3

Hunspell dictionaries are by nature case sensitive. The Hunspell stemmer thus 
needs an option to allow case insensitive matching of the dictionaries.

Imagine a query for microsofts. It will never be stemmed to the dictionary 
word Microsoft because of the case difference. This problem cannot be fixed 
by putting LowercaseFilter before Hunspell.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2792) Allow case insensitive Hunspell stemming

2011-09-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114067#comment-13114067
 ] 

Jan Høydahl commented on SOLR-2792:
---

Propose an option ignoreCase=true for HunspellStemFilterFactory, which 
effectively lowercases everything before matching.

 Allow case insensitive Hunspell stemming
 

 Key: SOLR-2792
 URL: https://issues.apache.org/jira/browse/SOLR-2792
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.5, 4.0
Reporter: Jan Høydahl

 Same as http://code.google.com/p/lucene-hunspell/issues/detail?id=3
 Hunspell dictionaries are by nature case sensitive. The Hunspell stemmer thus 
 needs an option to allow case insensitive matching of the dictionaries.
 Imagine a query for microsofts. It will never be stemmed to the dictionary 
 word Microsoft because of the case difference. This problem cannot be fixed 
 by putting LowercaseFilter before Hunspell.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1895) ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time

2011-09-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114068#comment-13114068
 ] 

Karl Wright commented on SOLR-1895:
---

I now have 4 versions of the plugin, all of which still are SearchComponents.  
The four are:

(1) uses filters and wildcards
(2) uses queries and wildcards
(3) uses filters and a special token to mark security fields that are empty
(4) uses queries and a special token to mark security fields that are empty

I've done some timings, using 5000 documents, a realistic number of user tokens 
(100), for 3000 user queries.  The numbers are interesting:

Filter + wildcard = 193948ms
Query + wildcard = 26137ms
Filter + token = 39012ms
Query + token = 25078ms

Since the current implementation is the first, and that's obviously by far the 
worst performancewise, I recommend switching to a query-based implementation 
regardless of whether it's a SearchComponent or query parser plugin.


 ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search 
 time
 --

 Key: SOLR-1895
 URL: https://issues.apache.org/jira/browse/SOLR-1895
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Reporter: Karl Wright
  Labels: document, security, solr
 Fix For: 3.5, 4.0

 Attachments: LCFSecurityFilter.java, LCFSecurityFilter.java, 
 LCFSecurityFilter.java, LCFSecurityFilter.java, 
 SOLR-1895-service-plugin.patch, SOLR-1895-service-plugin.patch, 
 SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, 
 SOLR-1895.patch, SOLR-1895.patch


 I've written an LCF SearchComponent which filters returned results based on 
 access tokens provided by LCF's authority service.  The component requires 
 you to configure the appropriate authority service URL base, e.g.:
   !-- LCF document security enforcement component --
   searchComponent name=lcfSecurity class=LCFSecurityFilter
 str 
 name=AuthorityServiceBaseURLhttp://localhost:8080/lcf-authority-service/str
   /searchComponent
 Also required are the following schema.xml additions:
!-- Security fields --
field name=allow_token_document type=string indexed=true 
 stored=false multiValued=true/
field name=deny_token_document type=string indexed=true 
 stored=false multiValued=true/
field name=allow_token_share type=string indexed=true 
 stored=false multiValued=true/
field name=deny_token_share type=string indexed=true stored=false 
 multiValued=true/
 Finally, to tie it into the standard request handler, it seems to need to run 
 last:
   requestHandler name=standard class=solr.SearchHandler default=true
 arr name=last-components
   strlcfSecurity/str
 /arr
 ...
 I have not set a package for this code.  Nor have I been able to get it 
 reviewed by someone as conversant with Solr as I would prefer.  It is my 
 hope, however, that this module will become part of the standard Solr 1.5 
 suite of search components, since that would tie it in with LCF nicely.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114071#comment-13114071
 ] 

Robert Muir commented on SOLR-2769:
---

I think we should be more cautious on recommending Hunspell on the wiki here, 
for these reasons:
* The algorithm relies entirely on the quality of the dictionary, for many of 
these languages the dictionary is not good for this purpose: no affix rules, 
just a list of words, etc
* Even in the case where a particular dictionary is pretty good, there are a 
number of problems: the primary use case of these dictionaries is spellchecking 
and that doesn't necessarily imply that the rules+affix combinations yield good 
results here.
* Finally, the usual problems of having a dictionary-based technique, languages 
are not static and there absolutely no  handling for OOV words.

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-09-24 Thread Robert Muir (JIRA)
rename optimize to a less cool-sounding name


 Key: LUCENE-3454
 URL: https://issues.apache.org/jira/browse/LUCENE-3454
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Robert Muir


I think users see the name optimize and feel they must do this, because who 
wants a suboptimal system? but this probably just results in wasted time and 
resources.

maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114121#comment-13114121
 ] 

Chris Male commented on LUCENE-3453:


Hey Robert,

Are you putting something together on this, or should I give it a shot?

 remove IndexDocValuesField
 --

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 Its confusing how we present CSF functionality to the user, its actually not 
 a field but an attribute of a field like  STORED or INDEXED.
 Otherwise, its really hard to think about CSF because there is a mismatch 
 between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114122#comment-13114122
 ] 

Robert Muir commented on LUCENE-3453:
-

please take it! it was just an idea after some discussion with Andrzej, who was 
experimenting in Luke (I think if you are not careful its easy to get norms 
with your indexdocvaluesfield?)

also I noticed in the tests that the added dv fields were hitting up 
Similarity... 

I have no ideas on naming or api, maybe UNINVERTED?

 remove IndexDocValuesField
 --

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 Its confusing how we present CSF functionality to the user, its actually not 
 a field but an attribute of a field like  STORED or INDEXED.
 Otherwise, its really hard to think about CSF because there is a mismatch 
 between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3453) remove IndexDocValuesField

2011-09-24 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male reassigned LUCENE-3453:
--

Assignee: Chris Male

 remove IndexDocValuesField
 --

 Key: LUCENE-3453
 URL: https://issues.apache.org/jira/browse/LUCENE-3453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
Assignee: Chris Male

 Its confusing how we present CSF functionality to the user, its actually not 
 a field but an attribute of a field like  STORED or INDEXED.
 Otherwise, its really hard to think about CSF because there is a mismatch 
 between the APIs and the index format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114124#comment-13114124
 ] 

Robert Muir commented on LUCENE-3454:
-

if anyone wants to take this, don't hesitate!

 rename optimize to a less cool-sounding name
 

 Key: LUCENE-3454
 URL: https://issues.apache.org/jira/browse/LUCENE-3454
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Robert Muir

 I think users see the name optimize and feel they must do this, because who 
 wants a suboptimal system? but this probably just results in wasted time and 
 resources.
 maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



SolrCloud Branch

2011-09-24 Thread Mark Miller
FYI

I'm going to make another SolrCloud branch to collect some ideas for the 
indexing side. I've got stuff I'm playing with, like leader election, that 
heavily intersect with other other stuff I'm playing, so juggling patches would 
just be a nightmare to impossible.


- Mark Miller
lucidimagination.com
2011.lucene-eurocon.org | Oct 17-20 | Barcelona











-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: SolrCloud Branch

2011-09-24 Thread Robert Muir
+1, we shouldn't hesitate to make branches I think. it makes it easier to
collaborate.
On Sep 24, 2011 8:39 PM, Mark Miller markrmil...@gmail.com wrote:
 FYI

 I'm going to make another SolrCloud branch to collect some ideas for the
indexing side. I've got stuff I'm playing with, like leader election, that
heavily intersect with other other stuff I'm playing, so juggling patches
would just be a nightmare to impossible.


 - Mark Miller
 lucidimagination.com
 2011.lucene-eurocon.org | Oct 17-20 | Barcelona











 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: showItems for LRUCache missing?

2011-09-24 Thread Bill Bell
Let's add it. Add a SOLR ticket in JIRA.

On 9/22/11 7:59 AM, Eric Pugh ep...@opensourceconnections.com wrote:

Folks,

I was trying to figure out what explicitly was in my various Solr caches.
 After building a custom request handler and using Reflection to gain
access to the private Map's in the caches, I realized that showItems is
an option on both fieldValueCache and FastLRUCache, but isn't an option
on LRUCache...  Is there a reason for that?  If I wanted to submit a
patch, is it best to do it against Trunk?

Eric

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from
http://www.packtpub.com/solr-1-4-enterprise-search-server
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.










-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Can we auto generate ID?

2011-09-24 Thread Bill Bell
It would be a great feature if the ID could be auto-generated by a GUID
inside the update or DIH handlers.

Thoughts?




[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-09-24 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114133#comment-13114133
 ] 

Otis Gospodnetic commented on LUCENE-3454:
--

Would it be wise to stick with a name less specific than collapseSegments for 
example, in order not to have an incorrect name that requires another renaming 
when this command ends up doing something new in the future?

 rename optimize to a less cool-sounding name
 

 Key: LUCENE-3454
 URL: https://issues.apache.org/jira/browse/LUCENE-3454
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Robert Muir

 I think users see the name optimize and feel they must do this, because who 
 wants a suboptimal system? but this probably just results in wasted time and 
 resources.
 maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114136#comment-13114136
 ] 

Chris Male commented on LUCENE-3396:


I haven't actually committed yet.  I was hoping you'd have a chance to review 
before I did.  I'll now commit.

 Make TokenStream Reuse Mandatory for Analyzers
 --

 Key: LUCENE-3396
 URL: https://issues.apache.org/jira/browse/LUCENE-3396
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch


 In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having 
 to return reusable TokenStreams.  This is a big chunk of work, but its time 
 to bite the bullet.
 I plan to attack this in the following way:
 - Collapse the logic of ReusableAnalyzerBase into Analyzer
 - Add a ReuseStrategy abstraction to Analyzer which controls whether the 
 TokenStreamComponents are reused globally (as they are today) or per-field.
 - Convert all Analyzers over to using TokenStreamComponents.  I've already 
 seen that some of the TokenStreams created in tests need some work to be 
 reusable (even if they aren't reused).
 - Remove Analyzer.reusableTokenStream and convert everything over to using 
 .tokenStream (which will now be returning reusable TokenStreams).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2757) Switch min(a,b) function to min(a,b,...)

2011-09-24 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2757:


Attachment: SOLR-2757-2.patch

Test cases

 Switch min(a,b) function to min(a,b,...)
 

 Key: SOLR-2757
 URL: https://issues.apache.org/jira/browse/SOLR-2757
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.4
Reporter: Bill Bell
Priority: Minor
 Attachments: SOLR-2757-2.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Would like the ability to use min(1,5,10,11) to return 1. To do that today it 
 is parenthesis nightmare:
 min(min(min(1,5),10),11)
 Should extend max() as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2757) Switch min(a,b) function to min(a,b,...)

2011-09-24 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2757:


Attachment: (was: SOLR-2757.patch)

 Switch min(a,b) function to min(a,b,...)
 

 Key: SOLR-2757
 URL: https://issues.apache.org/jira/browse/SOLR-2757
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.4
Reporter: Bill Bell
Priority: Minor
 Attachments: SOLR-2757-2.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Would like the ability to use min(1,5,10,11) to return 1. To do that today it 
 is parenthesis nightmare:
 min(min(min(1,5),10),11)
 Should extend max() as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114140#comment-13114140
 ] 

Robert Muir commented on LUCENE-2279:
-

since we have merged lucene and solr, and chris has fixed analyzer to have a 
performant api, not by experts but by default, I think we can mark this issue 
resolved?

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114148#comment-13114148
 ] 

Robert Muir commented on LUCENE-3055:
-

Chris has made a ton of progress here, I think we are very close, though it 
would be good revisit LUCENE-2788 in the future and ensure that for 4.0 
charfilters have a reusable API as well (this is currently not the case).


 LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
 --

 Key: LUCENE-3055
 URL: https://issues.apache.org/jira/browse/LUCENE-3055
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 3.1
Reporter: Ian Soboroff

 LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes 
 ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. 
 StandardAnalyzer to make a small modification e.g. to tokenStream().  These 
 issues don't indicate a new method of doing this.  The issues don't give a 
 reason except for design considerations, which seems a poor reason to make a 
 backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114149#comment-13114149
 ] 

Robert Muir commented on LUCENE-3396:
-

+1!

 Make TokenStream Reuse Mandatory for Analyzers
 --

 Key: LUCENE-3396
 URL: https://issues.apache.org/jira/browse/LUCENE-3396
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch


 In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having 
 to return reusable TokenStreams.  This is a big chunk of work, but its time 
 to bite the bullet.
 I plan to attack this in the following way:
 - Collapse the logic of ReusableAnalyzerBase into Analyzer
 - Add a ReuseStrategy abstraction to Analyzer which controls whether the 
 TokenStreamComponents are reused globally (as they are today) or per-field.
 - Convert all Analyzers over to using TokenStreamComponents.  I've already 
 seen that some of the TokenStreams created in tests need some work to be 
 reusable (even if they aren't reused).
 - Remove Analyzer.reusableTokenStream and convert everything over to using 
 .tokenStream (which will now be returning reusable TokenStreams).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114151#comment-13114151
 ] 

Chris Male commented on LUCENE-3396:


Committed revision 1175297.

I don't want to mark this as resolved just yet.  I want to spin off another 
sub-task to move all consumers over to reusableTokenStream (and then rename it 
back to tokenStream).

 Make TokenStream Reuse Mandatory for Analyzers
 --

 Key: LUCENE-3396
 URL: https://issues.apache.org/jira/browse/LUCENE-3396
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, 
 LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch


 In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having 
 to return reusable TokenStreams.  This is a big chunk of work, but its time 
 to bite the bullet.
 I plan to attack this in the following way:
 - Collapse the logic of ReusableAnalyzerBase into Analyzer
 - Add a ReuseStrategy abstraction to Analyzer which controls whether the 
 TokenStreamComponents are reused globally (as they are today) or per-field.
 - Convert all Analyzers over to using TokenStreamComponents.  I've already 
 seen that some of the TokenStreams created in tests need some work to be 
 reusable (even if they aren't reused).
 - Remove Analyzer.reusableTokenStream and convert everything over to using 
 .tokenStream (which will now be returning reusable TokenStreams).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory

2011-09-24 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114152#comment-13114152
 ] 

Chris Male commented on SOLR-2769:
--

Hey Jan,

I fixed the @see javadoc in the factory.  Both IntelliJ and ant javadoc 
reported that you can't do @see like that (with URLs).

 HunspellStemFilterFactory
 -

 Key: SOLR-2769
 URL: https://issues.apache.org/jira/browse/SOLR-2769
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
  Labels: stemming
 Fix For: 3.5, 4.0

 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, 
 SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch


 Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a 
 Factory for it in Solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3455) All Analysis Consumers should use reusableTokenStream

2011-09-24 Thread Chris Male (JIRA)
All Analysis Consumers should use reusableTokenStream
-

 Key: LUCENE-3455
 URL: https://issues.apache.org/jira/browse/LUCENE-3455
 Project: Lucene - Java
  Issue Type: Sub-task
Reporter: Chris Male


With Analyzer now using TokenStreamComponents, theres no reason for Analysis 
consumers to use tokenStream() (it just gives bad performance).  Consequently 
all consumers will be moved over to using reusableTokenStream().  The only 
challenge here is that reusableTokenStream throws an IOException which many 
consumers are not rigged to deal with.

Once all consumers have been moved, we can rename reusableTokenStream() back to 
tokenStream().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3455) All Analysis Consumers should use reusableTokenStream

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114153#comment-13114153
 ] 

Robert Muir commented on LUCENE-3455:
-

+1, there is a lot of crazy code around this area, consumers catching the 
exception from reusableTokenStream() and falling back to tokenStream() and 
other silly things.

 All Analysis Consumers should use reusableTokenStream
 -

 Key: LUCENE-3455
 URL: https://issues.apache.org/jira/browse/LUCENE-3455
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: modules/analysis
Reporter: Chris Male

 With Analyzer now using TokenStreamComponents, theres no reason for Analysis 
 consumers to use tokenStream() (it just gives bad performance).  Consequently 
 all consumers will be moved over to using reusableTokenStream().  The only 
 challenge here is that reusableTokenStream throws an IOException which many 
 consumers are not rigged to deal with.
 Once all consumers have been moved, we can rename reusableTokenStream() back 
 to tokenStream().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114154#comment-13114154
 ] 

Robert Muir commented on LUCENE-3454:
-

Otis: thats a good point, currently optimize is just a request, we should 
probably figure out what it should be.

Should it be collapseSegments, which is a well-defined request without a cool 
sounding name?

Or, should it be something else, which gives you a more optimal configuration 
for search performance... (i still think optimize is a bad name even for this)? 
Personally i suspect its going to be hard to support this case, e.g. you would 
really need to know things like, if the user has executionService set on the 
IndexSearcher and how big the threadpool is and things like that to make an 
'optimal' configuration... and we don't have a nice way of knowing that 
information today.


 rename optimize to a less cool-sounding name
 

 Key: LUCENE-3454
 URL: https://issues.apache.org/jira/browse/LUCENE-3454
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Robert Muir

 I think users see the name optimize and feel they must do this, because who 
 wants a suboptimal system? but this probably just results in wasted time and 
 resources.
 maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org