Need admin help with unsubscribe

2011-01-09 Thread Imre András
Hello all,

Sorry for posting this, but unsubscribe did not work for me. Please refer to my 
second attempt below. The first was on 01/12/2010.


Thanks,
  András


--Eredeti üzenet--
Dátum:2010. december 7., kedd, 02:16:57
Feladó:   Imre András ia...@freemail.hu
Tárgy:unsubscribe
Címzett:  pylucene-dev-unsubscr...@lucene.apache.org
unsubscribe



[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979274#action_12979274
 ] 

Simon Willnauer commented on LUCENE-2855:
-

+1 - just put your name after the description in the changes.txt 

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979276#action_12979276
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

bq. But doesn't that mean that an app w/ rare queries but each query is massive 
fails to use all available concurrency?
Yes. But that's not my case. And likely not someone else's.

I think if you want to be super-generic, it's better to defer exact threading 
to the user, instead of doing a one-size-fits-all solution. Else you risk 
conjuring another ConcurrentMergeScheduler.
While we're at it, we can throw in some sample implementation, which can 
satisfy some of the users, but not everyone.

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979277#action_12979277
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

And we're nearing a day when we keep the whole term dictionary in memory (as 
Sphinx does for instance).
At that point a gazillion of term lookup-related hacks (like lookup cache) 
become obsolete :)
Term dictionary itself can also be memory-mapped after this, instead of being 
read and built from disk, which makes new segment opening 
near-instantaneous.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979284#action_12979284
 ] 

Doron Cohen commented on LUCENE-2840:
-

Is it a possible that with this, searching a large optimized index (single 
segment) might be slower than searching an un-optimzed index of the same size, 
since the latter enjoys concurrency? If so, is it too wild for more than one 
thread to handle that single segment?

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979292#action_12979292
 ] 

Michael McCandless commented on LUCENE-2843:


In-memory terms dict would be great.  I agree it'd fundamentally change how we 
execute eg the automaton queries (suddenly we can just intersect against the 
terms dict instead of doing the seek/next thing); FuzzyQuery might be a direct 
search through the terms dict instead of first building the LevN DFA; 
respelling similarly...

But, I suspect we'll always have to support the on-disk only option because 
some apps seem to have an insane number of terms.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979293#action_12979293
 ] 

Michael McCandless commented on LUCENE-2840:


bq. I think if you want to be super-generic, it's better to defer exact 
threading to the user, instead of doing a one-size-fits-all solution. Else you 
risk conjuring another ConcurrentMergeScheduler.

I think something like CMS (basically a custom ES w/ proper thread 
prio/scheduling) will be necessary here.

Until Java can schedule threads the way an OS schedules processes we'll need to 
emulate it ourselves.

You want long running queries (or, merges) to be gracefully down prioritized so 
that new/fast queries (merges) finish quickly.

And you want searches (merges) to use the allowed concurrency fully.

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979295#action_12979295
 ] 

Simon Willnauer commented on LUCENE-1260:
-

bq. For trunk, here is what i suggest:
I didn't follow the entire thread here but is it worth all the effort what 
robert is suggesting or should we simply land docvalues branch and make norms a 
DocValues field? The infrastructure is already there, its integrated into codec 
and gives users the freedom to use any Type they want. 

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 3570 - Failure

2011-01-09 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3570/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads

Error Message:
CheckIndex failed

Stack Trace:
java.lang.RuntimeException: CheckIndex failed
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:87)
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:73)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:131)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:137)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads(TestIndexWriterOnJRECrash.java:61)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1049)




Build Log (for compile errors):
[...truncated 3068 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979303#action_12979303
 ] 

Robert Muir commented on LUCENE-1260:
-

bq. I didn't follow the entire thread here but is it worth all the effort what 
robert is suggesting or should we simply land docvalues branch and make norms a 
DocValues field? The infrastructure is already there, its integrated into codec 
and gives users the freedom to use any Type they want.

Simon, the the problem is encode/decode is in Similarity (instead of somewhere 
else).

So, you would have the same problem with DocValues!

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979305#action_12979305
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

As I said, there's already a search server with strictly in-memory (in mmap 
sense. it can theoretically be paged out) terms dict AND widespread adoption. 
Their users somehow manage.

My guess is that's because people with insane number of terms store various 
crap like unique timestamps as terms. With CSF (attributes in Sphinx lingo), 
and some nice filters that can work over CSF, there's no longer any need to 
stuff your timestamps in the same place you stuff your texts. That can be 
reflected in documentation, and then, suddenly, we can drop on-disk only 
support.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979306#action_12979306
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

A lot of fork-join type frameworks don't even care. Even though scheduling 
threads is something people supposedly use them for.
Why? I guess that's due to low yield/cost ratio.
You frequently quote progress, not perfection in relation to the code, but 
why don't we apply this same principle to our threading guarantees?
I don't want to use allowed concurrency fully. That's not realistic. I want 85% 
of it. That's already a huge leap ahead of single-threaded searches.


 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979308#action_12979308
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
I think with B
we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
greater in size and the aggregate RAM usage exceeds the max, using the calling
thread, we take DWPT #2 out of production, flush, and return it?
{quote}
Right -- the thread affinity has nothing to do with which thread gets to flush 
which DWPT.  Once flush is triggered, the thread doing the flushing is free to 
flush any DWPT.

{quote}
Maybe we can simply throw out the DWPT
and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?
{quote}

OK let's start there and put back re-use only if we see a real perf issue?

bq. What I meant was the following situation: Suppose we have two DWPTs and 
IW.commit() is called. The first DWPT finishes flushing successfully, is 
returned to the pool and idle again. The second DWPT flush fails with an 
aborting exception. 

Hmm, tricky.  I think I'd lean towards keeping segment 1.  Discarding it would 
be inconsistent w/ aborts hit during the flushed by RAM case?  EG if seg 1 
was flushed due to RAM usage, succeeds, and then later seg 2 is flushed due to 
RAM usage, but aborts.  In this case we would still keep seg 1?

I think aborting a flush should only lose the docs in that one DWPT (as it is 
today).

Remember, a call to commit may succeed in flushing seg 1 to disk, and updating 
the in-memory segment infos, but on hitting the aborting exc to seg 2, will 
throw that to the caller, not having committed *any* change to the index.  
Exceptions thrown during the prepareCommit (phase 1) part of commit mean 
nothing is changed in the index.

Alternatively... we could abort the entire IW session (as eg we handle OOME 
today) if ever an aborting exception was hit?  This might be cleaner?  But it's 
really a nuke the world option which scares me.  EG it could be a looong 
indexing session (app doesn't call commit() until the end) and we could be 
throwing away *alot* of progress.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979313#action_12979313
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. As I said, there's already a search server with strictly in-memory (in mmap 
sense. it can theoretically be paged out) terms dict AND widespread adoption. 
Their users somehow manage

I don't like the reasoning that, just because sphinx does it and their 'users 
manage', that makes it ok.
sphinx also requires mysql, which only when started supporting *real* utf-8?! 
(not that 3-byte crap they tried to pass off instead)

I don't think we should really be looking there for inspiration.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2846) omitTF is viral, but omitNorms is anti-viral.

2011-01-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2846:


Attachment: LUCENE-2846.patch

here's an initial patch hacked up by mike and I... also removed the 
multireader norms method that 
takes a byte[]+offset from IndexReader.

one oddity is that MultiNorms.norms() always returns a filled byte[] here for 
non-atomic readers (never null).
But i think this is ok for MultiNorms, its not used in searching (only for 
SlowMultiReaderWrapper etc)

i think somehow it would be good to have more tests that test doesnt have 
field versus omits norms,
and also (likely not in this is issue) we should think about IR's norm-setting 
methods.

I don't like that these use Similarity.getDefault(): it seems we could require 
you to pass in the Sim for the float case.
I also don't like that we expose a public setNorm that takes a byte value 
either!

Long-term we should look at pulling this norm-encoding stuff out of Sim... the 
Sim should just be dealing with floats,
this encoding stuff belongs somewhere else.


 omitTF is viral, but omitNorms is anti-viral.
 -

 Key: LUCENE-2846
 URL: https://issues.apache.org/jira/browse/LUCENE-2846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2846.patch


 omitTF is viral. if you add document 1 with field foo as omitTF, then 
 document 2 has field foo without omitTF, they are both treated as omitTF.
 but omitNorms is the opposite. if you have a million documents with field 
 foo with omitNorms, then you add just one document without omitting norms, 
 now you suddenly have a million 'real norms'.
 I think it would be good for omitNorms to be viral too, just for consistency, 
 and also to prevent huge byte[]'s.
 but another option is to make omitTF anti-viral, which is more schemaless i 
 guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979317#action_12979317
 ] 

Simon Willnauer commented on LUCENE-1260:
-

bq. So, you would have the same problem with DocValues!
hmm, not sure if I understand this correctly. how values are encoded / decoded 
depends on the DocValues implementation which can be customized since it is 
exposed via codec. That means that users of the API always operate on float and 
the encoding and decoding happens inside codec and per field. So encode/decode 
in Sim would be obsolet, right?

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979319#action_12979319
 ] 

Robert Muir commented on LUCENE-1260:
-

{quote}
hmm, not sure if I understand this correctly. how values are encoded / decoded 
depends on the DocValues implementation which can be customized since it is 
exposed via codec. That means that users of the API always operate on float and 
the encoding and decoding happens inside codec and per field. So encode/decode 
in Sim would be obsolet, right?
{quote}

the issues remaining here involve mostly fake norms, for the omitNorms case 
(also empty norms I think).
So, the stuff I listed must be fixed regardless, to clean up the fake norms 
case, it does not matter if real norms are encoded with CSF or not.

Doing things like cleaning up how we deal with fake norms, and removing 
Similarity.get/setDefault is completely unrelated to DocValues... its just 
stuff we must fix.

As long as we have these statics like Similarity.get/setDefault, its not even 
useful to think about things like flexible scoring or per-field SImilarity...!


 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979326#action_12979326
 ] 

Michael McCandless commented on LUCENE-1260:


I think we need to stop faking norms, independent of whether/when we cutover to 
CSF to store norms / index stats?

Ie the two issues are orthogonal (and both are important!).

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979328#action_12979328
 ] 

Yonik Seeley commented on LUCENE-1260:
--

bq. I think we need to stop faking norms, independent of whether/when we 
cutover to CSF to store norms / index stats? 

+1, it was only intended to be a short-term thing for back compat (see way back 
to LUCENE-448)

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2846) omitTF is viral, but omitNorms is anti-viral.

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979331#action_12979331
 ] 

Robert Muir commented on LUCENE-2846:
-

an alternative to totally clear up the faking here that mike thought of:

If we can somehow differentiate between omitNorms (null), and 'doesnt have 
field' (say, exception),
we wouldn't need to fake. In multinorms we could then safely return null if any 
reader returns null,
but throw an exception if all readers throw an exception.


 omitTF is viral, but omitNorms is anti-viral.
 -

 Key: LUCENE-2846
 URL: https://issues.apache.org/jira/browse/LUCENE-2846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2846.patch


 omitTF is viral. if you add document 1 with field foo as omitTF, then 
 document 2 has field foo without omitTF, they are both treated as omitTF.
 but omitNorms is the opposite. if you have a million documents with field 
 foo with omitNorms, then you add just one document without omitting norms, 
 now you suddenly have a million 'real norms'.
 I think it would be good for omitNorms to be viral too, just for consistency, 
 and also to prevent huge byte[]'s.
 but another option is to make omitTF anti-viral, which is more schemaless i 
 guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979334#action_12979334
 ] 

Michael McCandless commented on LUCENE-2843:


Yes doc values should cut back on these large term dicts.

But, I'm not a fan of pure disk-based terms dict.  Expecting the OS to make 
good decisions on what gets swapped out is risky -- Lucene is better informed 
than the OS on which data structures are worth spending RAM on (norms, terms 
index, field cache, del docs).

If indeed the terms dict (thanks to FSTs) becomes small enough to fit in RAM, 
then we should load it into RAM (and do away w/ the terms index).

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979337#action_12979337
 ] 

Michael McCandless commented on LUCENE-2840:


bq. You frequently quote progress, not perfection in relation to the code, 
but why don't we apply this same principle to our threading guarantees?

Oh we should definitely apply progress not perfection here -- in fact we 
already are: for starters (today), we bind concurrency to segments (so eg an 
optimized index has no concurrency), and we just use an ES (punt this thread 
scheduling problem to the caller).  This is better than nothing, but not good 
enough -- we can do better.

There's another quote that applies here: big dreams, small steps.  My comment 
above is dreaming but when it comes time to actually get the real work done / 
making progress towards that dream, of course we take baby steps / progress not 
perfection.

Design discussions should start w/ the big dreams but then once you've got a 
rough sense of where you want to get to in the future you shift back to the 
baby steps you do today, in the direction of that future goal.

Maybe I should wrap my comments in /dream tags and /babysteps tags!

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979345#action_12979345
 ] 

Yonik Seeley commented on LUCENE-2843:
--

bq. Their users somehow manage. 

That neglects to count those who are not users because they could not manage 
with the limitations ;-)

Anyway, being able to optionally keep the term dict in memory, per-field, if 
it's below a certain limits (terms/memory or whatever) would be very cool!

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979346#action_12979346
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. I don't like the reasoning that, just because sphinx does it and their 
'users manage', that makes it ok.
I'm in no way advocating it as an all-round better solution. It has it's 
wrinkles just as anything else.
My reasoning is merely that alternative exists, and it is viable. As proven by 
pretty high-profile users.
They have memory-resident term dictionary, and it works, I heard no complaints 
regarding this ever.

bq. sphinx also requires mysql
Have you read anything at all? It has an integration ready, for the layman user 
who just wants to stick a fulltext search into their little app, but it is in 
no way reliant on it.
Sphinx is a direct alternative to Solr.

{quote}
But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good 
decisions on what gets swapped out is risky - Lucene is better informed than 
the OS on which data structures are worth spending RAM on (norms, terms index, 
field cache, del docs).
If indeed the terms dict (thanks to FSTs) becomes small enough to fit in RAM, 
then we should load it into RAM (and do away w/ the terms index).
{quote}
That's a bit delusional. If a system is forced to swap out, it'll swap your 
explicitly managed RAM just as likely as memory-mapped files. I've seen this 
countless times.
But then, you have a number of benefits - like sharing filesystem cache when 
opening same file multiple times, offloading things from Java heap (which is 
almost always a good thing), fastest load-into-memory times possible.


Sorry, if I sound offending at times, but, damn, there's a whole world of 
simple and efficient code lying ahead in that direction :)

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979347#action_12979347
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. Have you read anything at all?

Nope, havent looked at their code... i think i stopped at the documentation 
when i saw how they analyzed text!

bq. Sorry, if I sound offending at times, but, damn, there's a whole world of 
simple and efficient code lying ahead in that direction

So where is the problem?

You can make your own all-on-disk impl, or all-in-ram impl and contribute it? 
And you dont have to implement terms dict cache,
thats contained in the implementation?

My problem is that we shouldnt assume all users can fit all their terms in RAM.

I think its great to offer alternative impls that work all in ram, and maybe if 
termsdict  X where X is some configurable value,
even consider using these automatically in standardcodec... but i don't see any 
benefit of 'forcing' this when we have this
whole flexible indexing thing!


 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979348#action_12979348
 ] 

Yonik Seeley commented on LUCENE-2843:
--

bq. My reasoning is merely that alternative exists, and it is viable. As proven 
by pretty high-profile users.

Actually, I sort of agree.  I read the in memory too fast and didn't realize 
you were talking about memory mapped.
There are other parts of sphinx that are kept directly in memory (not memory 
mapped) and do limit it's single-node scalability too much IMO.
Unfortunately, Java has additional overhead wrt mmap,  and you also can't do 
some stuff that you could do in C.  All this means is that trade-offs that made 
sense for C/C++ solutions may or may not make sense for Java solutions.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979353#action_12979353
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. Unfortunately, Java has additional overhead wrt mmap

Its not just that, you cant assume mmap even works (32-bit platform, even 
some troubles on 64-bit windows).
Because this is a search engine library, not just a server on 64-bit linux 
only, then we need to support
other situations like 32-bit users doing desktop search.

In other words, Test2BTerms in src/test should pass on my 32-bit windows 
machine with whatever we default to.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979366#action_12979366
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. Nope, havent looked at their code... i think i stopped at the documentation 
when i saw how they analyzed text!
All my points are contained within their documentation. No need to look at the 
code (it's as shady as Lucene's).
In the same manner, Lucene had crappy analyzis for years, until you've taken 
hold of (unicode) police baton.
So let's not allow color differences between our analyzers affect our judgement 
on other parts of ours : )

bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows 
machine with whatever we default to.
I'm questioning is there any legal, adequate reason to have that much terms.
I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms 
though :/

A hybrid solution, with term-dict being loaded completely into memory (either 
via mmap, or into arrays) on per-field basis, is probably best in the end, 
however sad it may be.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979372#action_12979372
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. A hybrid solution, with term-dict being loaded completely into memory 
(either via mmap, or into arrays) on per-field basis, is probably best in the 
end, however sad it may be.

Whats the sad part again? why does it bother you if there is another 
alternative codec setup or terms dict implementation if you aren't using it?
Should we also only have RAMDirectory and MMapDirectory and its sad that we 
have NIOFSDirectory?


 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2846) omitTF is viral, but omitNorms is anti-viral.

2011-01-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2846:


Attachment: LUCENE-2846.patch

here's an updated patch:
* The IR.setNorm(float) is also removed, forcing the user to use the correct 
similarity versus us using the wrong one (the static)
* MultiNorms doesn't fake norms anymore, instead it handles the case of 
non-existent field versus omitted norms.
* When a document doesnt have a field, its (undefined) norms are written as 
zero bytes instead of Similarity.getDefault().encodeNorm(1f). 
* All uses of Similarity.get/setDefault are now gone in lucene core, except for 
in IndexSearcher and IndexWriterConfig.


 omitTF is viral, but omitNorms is anti-viral.
 -

 Key: LUCENE-2846
 URL: https://issues.apache.org/jira/browse/LUCENE-2846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2846.patch, LUCENE-2846.patch


 omitTF is viral. if you add document 1 with field foo as omitTF, then 
 document 2 has field foo without omitTF, they are both treated as omitTF.
 but omitNorms is the opposite. if you have a million documents with field 
 foo with omitNorms, then you add just one document without omitting norms, 
 now you suddenly have a million 'real norms'.
 I think it would be good for omitNorms to be viral too, just for consistency, 
 and also to prevent huge byte[]'s.
 but another option is to make omitTF anti-viral, which is more schemaless i 
 guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2846) omitTF is viral, but omitNorms is anti-viral.

2011-01-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2846:


Attachment: LUCENE-2846.patch

sorry i had a piece of backwards logic in MultiNorms.

of course all tests pass either way, which is why we need a good mixed-schema 
test (with RIW) 
for this issue before it can go in (no matter what we do)

 omitTF is viral, but omitNorms is anti-viral.
 -

 Key: LUCENE-2846
 URL: https://issues.apache.org/jira/browse/LUCENE-2846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2846.patch, LUCENE-2846.patch, LUCENE-2846.patch


 omitTF is viral. if you add document 1 with field foo as omitTF, then 
 document 2 has field foo without omitTF, they are both treated as omitTF.
 but omitNorms is the opposite. if you have a million documents with field 
 foo with omitNorms, then you add just one document without omitting norms, 
 now you suddenly have a million 'real norms'.
 I think it would be good for omitNorms to be viral too, just for consistency, 
 and also to prevent huge byte[]'s.
 but another option is to make omitTF anti-viral, which is more schemaless i 
 guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979382#action_12979382
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

bq. Once flush is triggered, the thread doing the flushing is free to flush any 
DWPT.

OK.

bq. OK let's start there and put back re-use only if we see a real perf issue?

I think that's best.  Balancing RAM isn't implemented in the branch, we can't 
predict the future usage of DWPT(s) (which could languish consuming RAM with 
byte[]s well after they're flushed due to a sudden drop in the number of 
calling threads external to IW).

{quote}But it's really a nuke the world option which scares me. EG it could 
be a looong indexing session (app doesn't call commit() until the end) and we 
could be throwing away alot of progress.{quote}

Right.  Another option is to on commit try to flush all segments, meaning even 
if one DWPT/segment aborts, continue on with the other DWPTs (ie, a best 
effort).  Then perhaps throw an exception with a report of which segment 
flushes succeeded, or simply return a report object detailing what happened 
during commit (somewhat expert usage though).  Either way I think we need to 
give a few options to the user, then choose a default and see if it sticks.  
The default should probably be best effort.



 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-09 Thread Adriano Crestani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adriano Crestani updated LUCENE-2855:
-

Attachment: lucene_2855_adriano_crestani_2011_01_09.patch

Thanks for pointing out the problems, here is the new patch

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch, 
 lucene_2855_adriano_crestani_2011_01_09.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-01-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979390#action_12979390
 ] 

David Smiley commented on LUCENE-2611:
--

Steven,
  I don't know if another issue should be created but there's some extra 
additions to the IntelliJ setup that would be nice.  in vcs.xml, add this:
{code:xml}
component name=IssueNavigationConfiguration
option name=links
  list
IssueNavigationLink
  option name=issueRegexp value=[A-Z]+\-\d+ /
  option name=linkRegexp 
value=http://issues.apache.org/jira/browse/$0; /
/IssueNavigationLink
  /list
/option
  /component
{code}
And in workspace.xml, /project/compone...@name=ChangeListManager]/ add 
{code:xml}
ignored path=.idea/ /
ignored mask=*.iml /
{code}
And perhaps the copyright setup should be set up for ASL.


 IntelliJ IDEA and Eclipse setup
 ---

 Key: LUCENE-2611
 URL: https://issues.apache.org/jira/browse/LUCENE-2611
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2611-branch-3x-part2.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test_2.patch


 Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
 The attached patches add a new top level directory {{dev-tools/}} with 
 sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
 as well as top-level ant targets named idea and eclipse that copy these 
 files into the proper locations.  This arrangement avoids the messiness 
 attendant to in-place project configuration files directly checked into 
 source control.
 The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
 Solr contrib, and each analysis module.  A JUnit run configuration per module 
 is included.
 The Eclipse configuration includes a source entry for each 
 source/test/resource location and classpath setup: a library entry for each 
 jar.
 For IDEA, once {{ant idea}} has been run, the only configuration that must be 
 performed manually is configuring the project-level JDK.  For Eclipse, once 
 {{ant eclipse}} has been run, the user has to refresh the project 
 (right-click on the project and choose Refresh).
 If these patches is committed, Subversion svn:ignore properties should be 
 added/modified to ignore the destination IDEA and Eclipse configuration 
 locations.
 Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
 for applying the 3.X branch patch for IDEA: 
 http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979395#action_12979395
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 
4.x?  What's left to be done?

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 3586 - Failure

2011-01-09 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3586/

1 tests failed.
REGRESSION:  org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple

Error Message:
expected:3 but was:2

Stack Trace:
junit.framework.AssertionFailedError: expected:3 but was:2
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1049)
at 
org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple(TestLBHttpSolrServer.java:126)




Build Log (for compile errors):
[...truncated 8211 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2839) Visibility of Scorer.score(Collector, int, int) is wrong

2011-01-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-2839:
-

Assignee: Uwe Schindler

 Visibility of Scorer.score(Collector, int, int) is wrong
 

 Key: LUCENE-2839
 URL: https://issues.apache.org/jira/browse/LUCENE-2839
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0


 The method for scoring subsets in Scorer has wrong visibility, its marked 
 protected, but protected methods should not be called from other classes. 
 Protected methods are intended for methods that should be overridden by 
 subclasses and are called by (often) final methods of the same class. They 
 should never be called from foreign classes.
 This method is called from another class out-of-scope: BooleanScorer(2) - so 
 it must be public, but it's protected. This does not lead to a compiler error 
 because BS(2) is in same package, but may lead to problems if subclasses from 
 other packages override it. When implementing LUCENE-2838 I hit a trap, as I 
 thought tis method should only be called from the class or Scorer itsself, 
 but in fact its called from outside, leading to bugs, because I had not 
 overridden it. As ConstantScorer did not use it I have overridden it with 
 throw UOE and suddenly BooleanQuery was broken, which made it clear that it's 
 called from outside (which is not the intention of protected methods).
 We cannot fix this in 3.x, as it would break backwards for classes that 
 overwrite this method, but we can fix visibility in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2839) Visibility of Scorer.score(Collector, int, int) is wrong

2011-01-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2839:
--

Attachment: LUCENE-2839-3.x.patch
LUCENE-2839.patch

Here the patch for trunk and 3.x, will commit soon. In 3.x I simply added a 
note to Scorer's javadocs, that tells the user, that subclasses in user's code 
should declare the method as public to ease transition to 4.0.

 Visibility of Scorer.score(Collector, int, int) is wrong
 

 Key: LUCENE-2839
 URL: https://issues.apache.org/jira/browse/LUCENE-2839
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-2839-3.x.patch, LUCENE-2839.patch


 The method for scoring subsets in Scorer has wrong visibility, its marked 
 protected, but protected methods should not be called from other classes. 
 Protected methods are intended for methods that should be overridden by 
 subclasses and are called by (often) final methods of the same class. They 
 should never be called from foreign classes.
 This method is called from another class out-of-scope: BooleanScorer(2) - so 
 it must be public, but it's protected. This does not lead to a compiler error 
 because BS(2) is in same package, but may lead to problems if subclasses from 
 other packages override it. When implementing LUCENE-2838 I hit a trap, as I 
 thought tis method should only be called from the class or Scorer itsself, 
 but in fact its called from outside, leading to bugs, because I had not 
 overridden it. As ConstantScorer did not use it I have overridden it with 
 throw UOE and suddenly BooleanQuery was broken, which made it clear that it's 
 called from outside (which is not the intention of protected methods).
 We cannot fix this in 3.x, as it would break backwards for classes that 
 overwrite this method, but we can fix visibility in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2839) Visibility of Scorer.score(Collector, int, int) is wrong

2011-01-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2839.
---

Resolution: Fixed

Committed trunk revision: 1057010,
Committed javadoc updates revision: 1057011

 Visibility of Scorer.score(Collector, int, int) is wrong
 

 Key: LUCENE-2839
 URL: https://issues.apache.org/jira/browse/LUCENE-2839
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-2839-3.x.patch, LUCENE-2839.patch


 The method for scoring subsets in Scorer has wrong visibility, its marked 
 protected, but protected methods should not be called from other classes. 
 Protected methods are intended for methods that should be overridden by 
 subclasses and are called by (often) final methods of the same class. They 
 should never be called from foreign classes.
 This method is called from another class out-of-scope: BooleanScorer(2) - so 
 it must be public, but it's protected. This does not lead to a compiler error 
 because BS(2) is in same package, but may lead to problems if subclasses from 
 other packages override it. When implementing LUCENE-2838 I hit a trap, as I 
 thought tis method should only be called from the class or Scorer itsself, 
 but in fact its called from outside, leading to bugs, because I had not 
 overridden it. As ConstantScorer did not use it I have overridden it with 
 throw UOE and suddenly BooleanQuery was broken, which made it clear that it's 
 called from outside (which is not the intention of protected methods).
 We cannot fix this in 3.x, as it would break backwards for classes that 
 overwrite this method, but we can fix visibility in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979404#action_12979404
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into 
Lucene 4.x? What's left to be done?
we are very close - to land on trunk there is about an evening of work left. 
JDoc is missing here and there plus some tests for FieldComparators - thats it!

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-01-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979405#action_12979405
 ] 

Steven Rowe commented on LUCENE-2611:
-

Hi David,

Thanks for the input.

I don't think another issue is necessary.

I added the {{.idea/vcs.xml}} change to auto-linkify issues in log comments.  I 
didn't know this option existed.  Where does it do the auto-linkification? I 
don't see it in the log comment editor, and I also don't see it when I use 
browse an individual file's log messages (using the popup from the svnbar 
plugin toolbar icon).

But I did not add the {{.idea/workspace.xml}} change you propose (ignoring 
{{.idea/}} and {{.iml}} files), because those files are already ignored via 
{{svn:ignore}} properties.  When I added them, nothing changed for me - the 
files still show up in the project tree view greyed out, just as they did 
before I added the option.

I'm not sure it's a good idea to add copyright setup for ASL - I don't know 
enough about what this plugin does.

 IntelliJ IDEA and Eclipse setup
 ---

 Key: LUCENE-2611
 URL: https://issues.apache.org/jira/browse/LUCENE-2611
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2611-branch-3x-part2.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test_2.patch


 Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
 The attached patches add a new top level directory {{dev-tools/}} with 
 sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
 as well as top-level ant targets named idea and eclipse that copy these 
 files into the proper locations.  This arrangement avoids the messiness 
 attendant to in-place project configuration files directly checked into 
 source control.
 The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
 Solr contrib, and each analysis module.  A JUnit run configuration per module 
 is included.
 The Eclipse configuration includes a source entry for each 
 source/test/resource location and classpath setup: a library entry for each 
 jar.
 For IDEA, once {{ant idea}} has been run, the only configuration that must be 
 performed manually is configuring the project-level JDK.  For Eclipse, once 
 {{ant eclipse}} has been run, the user has to refresh the project 
 (right-click on the project and choose Refresh).
 If these patches is committed, Subversion svn:ignore properties should be 
 added/modified to ignore the destination IDEA and Eclipse configuration 
 locations.
 Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
 for applying the 3.X branch patch for IDEA: 
 http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979407#action_12979407
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

bq. we are very close - to land on trunk there is about an evening of work 
left. JDoc is missing here and there plus some tests for FieldComparators - 
thats it!

Nice!  Once it's in I'll try to get started on the RT field cache/doc values, 
which can likely be implemented and tested somewhat independent of the RT 
inverted index.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 3590 - Failure

2011-01-09 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3590/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads

Error Message:
CheckIndex failed

Stack Trace:
java.lang.RuntimeException: CheckIndex failed
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:87)
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:73)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:131)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:137)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads(TestIndexWriterOnJRECrash.java:61)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1049)




Build Log (for compile errors):
[...truncated 3101 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2272) Join

2011-01-09 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2272:
-

Component/s: search

 Join
 

 Key: SOLR-2272
 URL: https://issues.apache.org/jira/browse/SOLR-2272
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: SOLR-2272.patch


 Limited join functionality for Solr, mapping one set of IDs matching a query 
 to another set of IDs, based on the indexed tokens of the fields.
 Example:
 fq={!join  from=parent_ptr to:parent_id}child_doc:query

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-01-09 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979454#action_12979454
 ] 

Chris Male commented on LUCENE-2611:


.bq I'm not sure it's a good idea to add copyright setup for ASL - I don't know 
enough about what this plugin does.

I've used the copyright plugin a lot and its a great way to ensure that the ASL 
is added to any new files.  Might be useful to add it to reduce the hassle for 
new contributors.

 IntelliJ IDEA and Eclipse setup
 ---

 Key: LUCENE-2611
 URL: https://issues.apache.org/jira/browse/LUCENE-2611
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2611-branch-3x-part2.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test_2.patch


 Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
 The attached patches add a new top level directory {{dev-tools/}} with 
 sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
 as well as top-level ant targets named idea and eclipse that copy these 
 files into the proper locations.  This arrangement avoids the messiness 
 attendant to in-place project configuration files directly checked into 
 source control.
 The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
 Solr contrib, and each analysis module.  A JUnit run configuration per module 
 is included.
 The Eclipse configuration includes a source entry for each 
 source/test/resource location and classpath setup: a library entry for each 
 jar.
 For IDEA, once {{ant idea}} has been run, the only configuration that must be 
 performed manually is configuring the project-level JDK.  For Eclipse, once 
 {{ant eclipse}} has been run, the user has to refresh the project 
 (right-click on the project and choose Refresh).
 If these patches is committed, Subversion svn:ignore properties should be 
 added/modified to ignore the destination IDEA and Eclipse configuration 
 locations.
 Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
 for applying the 3.X branch patch for IDEA: 
 http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2611) IntelliJ IDEA and Eclipse setup

2011-01-09 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979454#action_12979454
 ] 

Chris Male edited comment on LUCENE-2611 at 1/9/11 8:51 PM:


bq. I'm not sure it's a good idea to add copyright setup for ASL - I don't know 
enough about what this plugin does.

I've used the copyright plugin a lot and its a great way to ensure that the ASL 
is added to any new files.  Might be useful to add it to reduce the hassle for 
new contributors.

  was (Author: cmale):
.bq I'm not sure it's a good idea to add copyright setup for ASL - I don't 
know enough about what this plugin does.

I've used the copyright plugin a lot and its a great way to ensure that the ASL 
is added to any new files.  Might be useful to add it to reduce the hassle for 
new contributors.
  
 IntelliJ IDEA and Eclipse setup
 ---

 Key: LUCENE-2611
 URL: https://issues.apache.org/jira/browse/LUCENE-2611
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2611-branch-3x-part2.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
 LUCENE-2611-branch-3x.patch, LUCENE-2611-part2.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
 LUCENE-2611_eclipse.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
 LUCENE-2611_test_2.patch


 Setting up Lucene/Solr in IntelliJ IDEA or Eclipse can be time-consuming.
 The attached patches add a new top level directory {{dev-tools/}} with 
 sub-dirs {{idea/}} and {{eclipse/}} containing basic setup files for trunk, 
 as well as top-level ant targets named idea and eclipse that copy these 
 files into the proper locations.  This arrangement avoids the messiness 
 attendant to in-place project configuration files directly checked into 
 source control.
 The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
 Solr contrib, and each analysis module.  A JUnit run configuration per module 
 is included.
 The Eclipse configuration includes a source entry for each 
 source/test/resource location and classpath setup: a library entry for each 
 jar.
 For IDEA, once {{ant idea}} has been run, the only configuration that must be 
 performed manually is configuring the project-level JDK.  For Eclipse, once 
 {{ant eclipse}} has been run, the user has to refresh the project 
 (right-click on the project and choose Refresh).
 If these patches is committed, Subversion svn:ignore properties should be 
 added/modified to ignore the destination IDEA and Eclipse configuration 
 locations.
 Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
 for applying the 3.X branch patch for IDEA: 
 http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-trunk - Build # 1421 - Failure

2011-01-09 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1421/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads

Error Message:
CheckIndex failed

Stack Trace:
java.lang.RuntimeException: CheckIndex failed
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:87)
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:73)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:131)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:137)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads(TestIndexWriterOnJRECrash.java:61)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1049)




Build Log (for compile errors):
[...truncated 7055 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2310) DocBuilder's getTimeElapsedSince Error

2011-01-09 Thread tom liu (JIRA)
DocBuilder's getTimeElapsedSince Error
--

 Key: SOLR-2310
 URL: https://issues.apache.org/jira/browse/SOLR-2310
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 4.0
 Environment: JDK1.6
Reporter: tom liu


i has a job which runs about 65 hours, but the dataimport?command=status http 
requests returns 5 hours.

in getTimeElapsedSince method of DocBuilder:
{noformat} 
static String getTimeElapsedSince(long l) {
l = System.currentTimeMillis() - l;
return (l / (6 * 60)) % 60 + : + (l / 6) % 60 + : + (l / 1000)
% 60 + . + l % 1000;
  }
{noformat} 

the hours Compute is wrong, it mould be :
{noformat} 
static String getTimeElapsedSince(long l) {
l = System.currentTimeMillis() - l;
return (l / (6 * 60)) + : + (l / 6) % 60 + : + (l / 1000)
% 60 + . + l % 1000;
  }
{noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-trunk - Build # 1421 - Failure

2011-01-09 Thread Robert Muir
On Sun, Jan 9, 2011 at 9:40 PM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1421/

 1 tests failed.
 REGRESSION:  org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads

 Error Message:
 CheckIndex failed

maybe this is specific to pulsing? I noticed its failed 3 times with
this identical pulsing stacktrace:
Lucene-trunk/1421, tests-only/3590, tests-only/3570

However, this time it failed in a nightly build (perhaps the indexes
are still available on the hudson machine if we salvage before the
next nightly build?)
it should be under lucene/build/test/N/jrecrashXXtmp/

all 3 times the stacktrace is:
test: terms, freq, prox...ERROR [Java heap space]
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.lucene.index.codecs.pulsing.PulsingPostingsWriterImpl$Position.clone(PulsingPostingsWriterImpl.java:104)
at 
org.apache.lucene.index.codecs.pulsing.PulsingPostingsWriterImpl$Document.clone(PulsingPostingsWriterImpl.java:74)
at 
org.apache.lucene.index.codecs.pulsing.PulsingPostingsReaderImpl$PulsingTermState.clone(PulsingPostingsReaderImpl.java:72)
at 
org.apache.lucene.index.codecs.pulsing.PulsingPostingsReaderImpl$PulsingDocsEnum.reset(PulsingPostingsReaderImpl.java:234)
at 
org.apache.lucene.index.codecs.pulsing.PulsingPostingsReaderImpl.docs(PulsingPostingsReaderImpl.java:189)
at 
org.apache.lucene.index.codecs.PrefixCodedTermsReader$FieldReader$SegmentTermsEnum.docs(PrefixCodedTermsReader.java:515)
at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:756)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:489)
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:83)
at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:73)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:131)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:137)
at 
org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads(TestIndexWriterOnJRECrash.java:61)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-09 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2657:


Attachment: LUCENE-2657.patch

Added profiles to populate internal repositories at {{lucene/dist/maven/}} and 
{{solr/dist/maven/}} with generated artifacts.

To populate {{lucene/dist/maven/}} with POMs and binary, source and javadoc 
artifacts, run the following from the top level Lucene/Solr directory:

{code}
mvn -N -P bootstrap,deploy-to-lucene-dist-maven-repository deploy
cd lucene
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
cd ../modules
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
{code}

To populate {{lucene/dist/solr/}}, run the following from the top level 
Lucene/Solr directory:

{code}
mvn -N -P bootstrap
cd solr
mvn -DskipTests -Pdeploy-to-solr-dist-maven-repository,javadocs-jar,source-jar 
deploy
{code}

 Replace Maven POM templates with full POMs, and change documentation 
 accordingly
 

 Key: LUCENE-2657
 URL: https://issues.apache.org/jira/browse/LUCENE-2657
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch


 The current Maven POM templates only contain dependency information, the bare 
 bones necessary for uploading artifacts to the Maven repository.
 The full Maven POMs in the attached patch include the information necessary 
 to run a multi-module Maven build, in addition to serving the same purpose as 
 the current POM templates.
 Several dependencies are not available through public maven repositories.  A 
 profile in the top-level POM can be activated to install these dependencies 
 from the various {{lib/}} directories into your local repository.  From the 
 top-level directory:
 {code}
 mvn -N -Pbootstrap install
 {code}
 Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
 tests via Maven's surefire plugin, and populate your local repository with 
 all artifacts, from the top level directory, run:
 {code}
 mvn install
 {code}
 When one Lucene/Solr module depends on another, the dependency is declared on 
 the *artifact(s)* produced by the other module and deposited in your local 
 repository, rather than on the other module's un-jarred compiler output in 
 the {{build/}} directory, so you must run {{mvn install}} on the other module 
 before its changes are visible to the module that depends on it.
 To create all the artifacts without running tests:
 {code}
 mvn -DskipTests install
 {code}
 I almost always include the {{clean}} phase when I do a build, e.g.:
 {code}
 mvn -DskipTests clean install
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979496#action_12979496
 ] 

Steven Rowe edited comment on LUCENE-2657 at 1/10/11 2:38 AM:
--

Added profiles to populate internal repositories at {{lucene/dist/maven/}} and 
{{solr/dist/maven/}} with generated artifacts.

To populate {{lucene/dist/maven/}} with POMs and binary, source and javadoc 
artifacts, run the following from the top level Lucene/Solr directory:

{code}
mvn -N -P bootstrap,deploy-to-lucene-dist-maven-repository deploy
cd lucene
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
cd ../modules
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
{code}

To populate {{lucene/dist/solr/}}, run the following from the top level 
Lucene/Solr directory:

{code}
mvn -N -P bootstrap install
cd solr
mvn -DskipTests -Pdeploy-to-solr-dist-maven-repository,javadocs-jar,source-jar 
deploy
{code}

  was (Author: steve_rowe):
Added profiles to populate internal repositories at {{lucene/dist/maven/}} 
and {{solr/dist/maven/}} with generated artifacts.

To populate {{lucene/dist/maven/}} with POMs and binary, source and javadoc 
artifacts, run the following from the top level Lucene/Solr directory:

{code}
mvn -N -P bootstrap,deploy-to-lucene-dist-maven-repository deploy
cd lucene
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
cd ../modules
mvn -DskipTests 
-Pdeploy-to-lucene-dist-maven-repository,javadocs-jar,source-jar deploy
{code}

To populate {{lucene/dist/solr/}}, run the following from the top level 
Lucene/Solr directory:

{code}
mvn -N -P bootstrap
cd solr
mvn -DskipTests -Pdeploy-to-solr-dist-maven-repository,javadocs-jar,source-jar 
deploy
{code}
  
 Replace Maven POM templates with full POMs, and change documentation 
 accordingly
 

 Key: LUCENE-2657
 URL: https://issues.apache.org/jira/browse/LUCENE-2657
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch


 The current Maven POM templates only contain dependency information, the bare 
 bones necessary for uploading artifacts to the Maven repository.
 The full Maven POMs in the attached patch include the information necessary 
 to run a multi-module Maven build, in addition to serving the same purpose as 
 the current POM templates.
 Several dependencies are not available through public maven repositories.  A 
 profile in the top-level POM can be activated to install these dependencies 
 from the various {{lib/}} directories into your local repository.  From the 
 top-level directory:
 {code}
 mvn -N -Pbootstrap install
 {code}
 Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
 tests via Maven's surefire plugin, and populate your local repository with 
 all artifacts, from the top level directory, run:
 {code}
 mvn install
 {code}
 When one Lucene/Solr module depends on another, the dependency is declared on 
 the *artifact(s)* produced by the other module and deposited in your local 
 repository, rather than on the other module's un-jarred compiler output in 
 the {{build/}} directory, so you must run {{mvn install}} on the other module 
 before its changes are visible to the module that depends on it.
 To create all the artifacts without running tests:
 {code}
 mvn -DskipTests install
 {code}
 I almost always include the {{clean}} phase when I do a build, e.g.:
 {code}
 mvn -DskipTests clean install
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org