date:20110924

[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-24 Thread Chris Male (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113912#comment-13113912
 ] 

Chris Male commented on LUCENE-1536:


Actually, the more I look at the nocommits, the less I like what I've 
suggested.  I think having getRandomAccessBits as it is in the patch is fine.  
But I like we should maybe make setLiveDocsOnly and 
setAllowRandomAccessFiltering 1st class features of the Bits interface.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Prettify JS and CSS exceluded from Javadocs

2011-09-24 Thread Shai Erera

Hi Steve,

As I noted before, jarring prettify won't solve the problem entirely, as the
references in the HTML will point to an incorrect location.

Why don't we just package prettify.js in the jar like we do with
stylesheet+prettify.css? We only need this .js as we only write in Java ...

It's a tiny file, and it will simplify the whole process. Wha do you think?

Shai

On Thu, Sep 22, 2011 at 5:05 PM, Steven A Rowe sar...@syr.edu wrote:

 The patch I gave for lucene/contrib-build.xml's javadocs target was wrong
 (I placed the nested tag outside of the jarify invocation).  Here's a
 fixed patch:

 Index: lucene/contrib/contrib-build.xml
 ===
 --- lucene/contrib/contrib-build.xml(revision 1165174)
 +++ lucene/contrib/contrib-build.xml(revision )
 @@ -95,7 +95,11 @@
 packageset dir=${src.dir}/
 /sources
/invoke-javadoc
 -  jarify basedir=${javadoc.dir}/contrib-${name}
 destfile=${build.dir}/${final.name}-javadoc.jar/
 +  jarify basedir=${javadoc.dir}/contrib-${name}
 destfile=${build.dir}/${final.name}-javadoc.jar
 +nested
 +  fileset dir=${prettify.dir}/
 +/nested
 +   /jarify
 /sequential
   /target

[jira] [Commented] (LUCENE-3452) The native FS lock used in test-framework's o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system

2011-09-24 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113932#comment-13113932
]

Uwe Schindler commented on LUCENE-3452:
---

bq. On the first pass, the build hung in the middle of the lucene core tests -
I killed the process after half an hour with no output. I restarted the tests,
and the build made it through the Lucene tests, but then at least one Solr core
test failed.

We had this several times on Jenkins, too. I killed the JVM approx 4 times the
last 2 weeks.

The native FS lock used in test-framework's
o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system
--

Key: LUCENE-3452
URL: https://issues.apache.org/jira/browse/LUCENE-3452
Project: Lucene - Java
Issue Type: Bug
Components: general/test
Affects Versions: 3.4, 4.0
Reporter: Steven Rowe
Priority: Minor
Attachments: LUCENE-3452.patch

{{LuceneJUnitResultFormatter}} uses a lock to buffer test suites' output, so
that when run in parallel, they don't interrupt each other when they are
displayed on the console.
The current implementation uses a fixed directory ({{lucene_junit_lock/}} in
{{java.io.tmpdir}} (by default {{/tmp/}} on Unix/Linux systems) as the
location of this lock. This functionality was introduced on SOLR-1835.
As Shawn Heisey reported on SOLR-2739, some tests fail when run as root, but
succeed when run as a non-root user.
On #lucene IRC today, Shawn wrote:
{quote}
(2:06:07 PM) elyograg: Now that I know I can't run the tests as root, I have
discovered /tmp/lucene_junit_lock. Once you run the tests as user A, you
cannot run them again as user B until that directory is deleted, and only
root or the original user can do so.
{quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113957#comment-13113957
 ] 

Michael McCandless commented on LUCENE-3449:


bq. Nomatter what, this loop needs 2 ifs per cycle

Duh, I was wrong about this!

We just need to change the sentinel value returned by nextSetBit when
there is no next set bit, from -1 to MAX_INT.  In fact we did this for
DISI.nextDoc, for the same reason (saves an if per cycle).

Then you just rotate the loop:

{noformat}
  if (bits.length() != 0) {
int bit = bits.nextSetBit(0);
final int limit = bits.length()-1;
while (bit  limit) {
  // ...do something with bit...
  bit = bits.nextSetBit(1+bit);
}

if (bit == bits.length()-1) {
  // ...do something with bit...
}
  }
{noformat}

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-24 Thread hadas raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113958#comment-13113958
 ] 

hadas raviv commented on LUCENE-2959:
-

Hi,

First of all, I would like to thank you for the great contribution you made by 
adding the state of the art ranking methods to lucene. I was waiting for these  
features for a long time, since they enable an IR researcher like me to use 
lucene, which is a powerful tool, for research purposes.

I downloaded the last version of lucene trunk and played a little with the 
models you implemented. There is question I have and I would really appreciate 
your answer (my apology in advance - I'm new to lucene so maybe this question 
is trivial for you):

I saw that you didn't change the default implementation of lucene for coding 
the document length which is used for ranking in language models (one byte for 
coding the document length together with boosting). Why did you decide that? Is 
it possible to save the real document length coded in some other way (maybe 
with the new flexible index)? Is there any example for such an implementation? 
It is just that I'm concerned with the effect of using an inaccurate document 
length on results quality. Did you check this issue?

In addition - do you know about intentions to implement some more advanced 
ranking models (such as relevance models, mrf) in the near future?

Thanks in advance,
Hadas

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery

2011-09-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113960#comment-13113960
 ] 

Michael McCandless commented on LUCENE-3451:


Patch looks great Uwe!

Nice catch on the analyzers removing stop words and then making an all MUST_NOT 
BQ.  But, I think we should throw an exception in this case, since it's a 
horrible trap now?  User will get 0 results but that's flat out silently wrong?

 Remove special handling of pure negative Filters in BooleanFilter, disallow 
 pure negative queries in BooleanQuery
 -

 Key: LUCENE-3451
 URL: https://issues.apache.org/jira/browse/LUCENE-3451
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch


 We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows 
 pure negative Filter clauses. This is not supported by BooleanQuery and 
 confuses users (I think that's the problem in LUCENE-3450).
 The hack is buggy, as it does not respect deleted documents and returns them 
 in its DocIdSet.
 Also we should think about disallowing pure-negative Queries at all and throw 
 UOE.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book

2011-09-24 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113961#comment-13113961
 ] 

Dawid Weiss commented on LUCENE-3449:
-

Sorry, but my gut feeling says no to this loop logic. It just seems strangely 
complicated. If adherence to BitSet is not an issue why not:

{code}
for (int i = bs.firstSetBit(); i = 0; i = bs.nextSetBitAfter(i)) {
{code}

this seems clearer on method naming, has a single if... and I think could be 
implemented nearly identically to what's already in the code. We can run 
microbenchmarks for fun and see what comes out better an by what margin. 

 Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in 
 every programming book
 ---

 Key: LUCENE-3449
 URL: https://issues.apache.org/jira/browse/LUCENE-3449
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Affects Versions: 3.4, 4.0
Reporter: Uwe Schindler
Priority: Minor
 Attachments: LUCENE-3449.patch


 The usage pattern for nextSetBit/prevSetBit is the following:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) {
  // operate on index i here
 }
 {code}
 The problem is that the i+1 at the end can be bs.length(), but the code in 
 nextSetBit does not allow this (same applies to prevSetBit(0)). The above 
 usage pattern is in every programming book, so it should really be supported. 
 The check has to be done in all cases (with the current impl in the calling 
 code).
 If the check is done inside xxxSetBit() it can also be optimized to be only 
 called seldom and not all the time, like in the ugly looking replacement, 
 thats currently needed:
 {code:java}
 for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : 
 -1) {
  // operate on index i here
 }
 {code}
 We should change this and allow out-of bounds indexes for those two methods 
 (they already do some checks in that direction). Enforcing this with an 
 assert is unuseable on the client side.
 The test code for FixedBitSet also uses this, horrible. Please support the 
 common usage pattern for BitSets.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

2011-09-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113965#comment-13113965
 ] 

Michael McCandless commented on LUCENE-1536:


bq. But I like we should maybe make setLiveDocsOnly and 
setAllowRandomAccessFiltering 1st class features of the Bits interface.

Hmm... that also makes me a bit nervous ;) Bits is too low-level for
these concepts?  Ie whether a filter/DIS folded in live docs
already, and whether the filter/DIS is best applied by iteration vs by
random access, are higher level filter concepts, not low level Bits
concepts, I think?

Also, Bits by definition is random-access so I don't think it should
have set/getAllowRandomAccessFiltering.


 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
 LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr

2011-09-24 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113967#comment-13113967
 ] 

Grant Ingersoll commented on LUCENE-3435:
-

A patch would be great for all of these things.  Thanks!

 Create a Size Estimator model for Lucene and Solr
 -

 Key: LUCENE-3435
 URL: https://issues.apache.org/jira/browse/LUCENE-3435
 Project: Lucene - Java
  Issue Type: Task
  Components: core/other
Affects Versions: 4.0
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 It is often handy to be able to estimate the amount of memory and disk space 
 that both Lucene and Solr use, given certain assumptions.  I intend to check 
 in an Excel spreadsheet that allows people to estimate memory and disk usage 
 for trunk.  I propose to put it under dev-tools, as I don't think it should 
 be official documentation just yet and like the IDE stuff, we'll see how well 
 it gets maintained.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

60 matches

Mail list logo