Lucene-Solr-tests-only-3.x - Build # 3533 - Failure

2011-01-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3533/

1 tests failed.
FAILED:  org.apache.lucene.util.TestVersion.testFilter

Error Message:
Forked Java VM exited abnormally. Please note the time in the report does not 
reflect the time until the VM exit.

Stack Trace:
junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please 
note the time in the report does not reflect the time until the VM exit.
at java.lang.Thread.run(Thread.java:636)




Build Log (for compile errors):
[...truncated 8470 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979264#action_12979264
 ] 

Lance Norskog commented on SOLR-2129:
-

bq.Don't want to at least log this? } catch (AnalysisEngineProcessException 
e) { // do nothing }

bq. I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be 
reasonable to log the error in this case (even if I don't like logging 
exceptions in general).

Please do not hide errors in any way. Nobody reads logs. If it fails in 
production, I want to know immediately and fix it. Please just throw all 
exceptions up the stack.

> Provide a Solr module for dynamic metadata extraction/indexing with Apache 
> UIMA
> ---
>
> Key: SOLR-2129
> URL: https://issues.apache.org/jira/browse/SOLR-2129
> Project: Solr
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Robert Muir
> Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
> SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
> SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be 
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document 
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which 
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
> (with a tokenizer and an hidden Markov model tagger), named entities, 
> language, suggested category, keywords and concepts (exploiting external 
> services from OpenCalais and AlchemyAPI). Such an implementation can be 
> easily extended adding or selecting different UIMA analysis engines, both 
> from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979255#action_12979255
 ] 

Uwe Schindler commented on LUCENE-2855:
---

One thing in your patch: Lucene tests should always extend LuceneTestCase 
(which is Junit4)

> Contrib queryparser should not use CharSequence as Map key
> --
>
> Key: LUCENE-2855
> URL: https://issues.apache.org/jira/browse/LUCENE-2855
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0.3
>Reporter: Adriano Crestani
>Assignee: Adriano Crestani
> Fix For: 3.0.4
>
> Attachments: lucene_2855_adriano_crestani_2011_01_08.patch
>
>
> Today, contrib query parser uses Map in many different 
> places, which may lead to problems, since CharSequence interface does not 
> enforce the implementation of hashcode and equals methods. Today, it's 
> causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
> method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979253#action_12979253
 ] 

Uwe Schindler commented on LUCENE-2855:
---

+1 to commit.

In general, one should never use interfaces as keys in maps (as long as they 
don't declare the equals and hashcode methods inside the interface).

> Contrib queryparser should not use CharSequence as Map key
> --
>
> Key: LUCENE-2855
> URL: https://issues.apache.org/jira/browse/LUCENE-2855
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0.3
>Reporter: Adriano Crestani
>Assignee: Adriano Crestani
> Fix For: 3.0.4
>
> Attachments: lucene_2855_adriano_crestani_2011_01_08.patch
>
>
> Today, contrib query parser uses Map in many different 
> places, which may lead to problems, since CharSequence interface does not 
> enforce the implementation of hashcode and equals methods. Today, it's 
> causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
> method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979252#action_12979252
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}I think segment 1 shouldn't be committed, ie. a global flush should be 
all or nothing. This means we would have to delay the commit of the segments 
until all DWPTs flushed successfully.{quote}

If a DWPT aborts during flush, we simply throw an exception, however we still 
keep the successfully flushed segment(s).  If there's an abort on any DWPT 
during commit then we throw away any successfully flushed segments as well.  I 
think that makes sense, eg, all or nothing.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Adriano Crestani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adriano Crestani updated LUCENE-2855:
-

Attachment: lucene_2855_adriano_crestani_2011_01_08.patch

Here is the fix for the problem raised at thread [1]. The patch also includes a 
junit to make sure the problem doesn't show up again.

If there are no concerns in two days, I will go ahead and commit the patch.

[1] - http://lucene.markmail.org/thread/mbb5wlxttsa6sges

> Contrib queryparser should not use CharSequence as Map key
> --
>
> Key: LUCENE-2855
> URL: https://issues.apache.org/jira/browse/LUCENE-2855
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0.3
>Reporter: Adriano Crestani
>Assignee: Adriano Crestani
> Fix For: 3.0.4
>
> Attachments: lucene_2855_adriano_crestani_2011_01_08.patch
>
>
> Today, contrib query parser uses Map in many different 
> places, which may lead to problems, since CharSequence interface does not 
> enforce the implementation of hashcode and equals methods. Today, it's 
> causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
> method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979248#action_12979248
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I think start simple - the addDocument always happens? Ie it's never 
coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since 
flush is single threaded, there should always be a free DWPT?
{quote}

Yeah I agree.  The change I'll make then is to not have the global lock and 
return a DWPT immediately to the pool and set it to 'idle' after its flush 
completed.

{quote}
I think we should continue what we do today? Ie, if it's an 'aborting' 
exception, then the entire segment held by that DWPT is discarded? And we then 
throw this exc back to caller (and don't try to flush any other segments)?
{quote}

What I meant was the following situation: Suppose we have two DWPTs and 
IW.commit() is called.  The first DWPT finishes flushing successfully, is 
returned to the pool and idle again.  The second DWPT flush fails with an 
aborting exception.  Should the segment of the first DWPT make it into the 
index or not?  I think segment 1 shouldn't be committed, ie. a global flush 
should be all or nothing.  This means we would have to delay the commit of the 
segments until all DWPTs flushed successfully.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979247#action_12979247
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile.

The DWPT will not be removed from the pool, just marked as busy during flush, 
like as its state is busy (or currently called "non-idle" in the code) during 
addDocumentI().  So no new DWPT would be created during flush if the 
maxThreadState limit was already reached.




> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979243#action_12979243
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

To further clarify, we also no longer have global aborts?  Each abort only 
applies to an individual DWPT?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979229#action_12979229
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}the "flush the world" case? (Ie the app calls IW.commit or
IW.getReader). In this case the thread just one by one pulls all DWPTs that
have any indexed docs out of production, flushes them, clears them, and returns
them to production?{quote}

The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and 
B) flush by RAM usage when adding docs or deleting. A is clear! I think with B
we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
greater in size and the aggregate RAM usage exceeds the max, using the calling
thread, we take DWPT #2 out of production, flush, and return it?

{quote}The behavior of calling IW.close while other threads are still adding
docs has never been defined (and, shouldn't be) except that we won't corrupt
your index, and we'll get all docs indexed before .close was called, committed.
So I think even for this case we don't need a global lock.{quote}

Great, that simplifies and clarifies that we do not require a global lock.

{quote}But, you're right: maybe we should sometimes "prune" DWPTs. Or simply
stop recycling any RAM, so that a just-flushed DWPT is an empty shell.{quote}

I'm not sure how we'd prune, typically object pools have a separate eviction
thread, I think that's going overboard? Maybe we can simply throw out the DWPT
and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-236) Field collapsing

2011-01-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979193#action_12979193
 ] 

Samuel García Martínez commented on SOLR-236:
-

The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query 
cache and the signature key that is using to store cached results. 

To sum up, if you perform a filter query and then, you perform that query using 
collapse field, that query result is already cached, but not cached as expected 
by this component. Resulting that the DocSet implementation is not the expected 
one, and, as cached result, the DocumentCollector is not executed at any time.

As soon as i can ill post a patch using combined key to cache results, formed 
by the collector class and the query itself.

Colbenson - Findability Experts 
http://www.colbenson.es/



> Field collapsing
> 
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Emmanuel Keller
>Assignee: Shalin Shekhar Mangar
> Fix For: Next
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
> field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
> quasidistributed.additional.patch, 
> SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, 
> SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
> SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2829) improve termquery "pk lookup" performance

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2829:
---

Attachment: LUCENE-2829.patch

New patch.  I added VirtualMethods to Sim to make sure Sim subclasses that 
don't override idfExplain that takes docFreq are still called.

> improve termquery "pk lookup" performance
> -
>
> Key: LUCENE-2829
> URL: https://issues.apache.org/jira/browse/LUCENE-2829
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2829.patch, LUCENE-2829.patch, LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst 
> case is primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
> concerned we could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in 
> termquery... we could just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979189#action_12979189
 ] 

Michael McCandless commented on LUCENE-2324:


bq. The proposed change is simply the thread calling add doc will flush it's 
DWPT if needed, take it offline while doing so, and return it when completed.

Wait -- this is the "addDocument" case right?  (I thought we were still talking 
about the "flush the world" case...).

bq.  I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile?

A new DWPT will have been created only if more than one thread is indexing docs 
right?  In which case this is fine?  Ie the old DWPT (just flushed) will just 
go back into rotation, and when another thread comes in it can take it?

But, you're right: maybe we should sometimes "prune" DWPTs.  Or simply stop 
recycling any RAM, so that a just-flushed DWPT is an empty shell.

bq. However I think we may still need the global lock for close, eg, today 
we're preventing the user from adding docs during close, after this issue is 
merged that behavior would change?

Well, the threads still adding docs will hit AlreadyClosedException?  (But, 
that's just "best effort").  The behavior of calling IW.close while other 
threads are still adding docs has never been defined (and, shouldn't be) except 
that we won't corrupt your index, and we'll get all docs indexed before .close 
was called, committed.  So I think even for this case we don't need a global 
lock.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979190#action_12979190
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
And there's the case of the thread calling flush doesn't yet have a DWPT, it's 
going to need to get one assigned to it, however the one assigned may not be 
the max ram consumer. What'll we do then? If the user explicitly called flush 
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however 
that gets hairy with wait notifies (almost like the global lock?).
{quote}

Wait -- why would the thread calling flush need to have a DWPT assigned to it?  
You're talking about the "flush the world" case?  (Ie the app calls IW.commit 
or IW.getReader).  In this case the thread just one by one pulls all DWPTs that 
have any indexed docs out of production, flushes them, clears them, and returns 
them to production?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2288) clean up compiler warnings

2011-01-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979188#action_12979188
 ] 

Hoss Man commented on SOLR-2288:


Reminder to self: feedback from rmuir on the mailing list to replace the static 
EMPTY set/map refs w/type info that i added with direct usage like this...

-  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
+  this(fieldName, fieldType, analyzer, Collections.emptySet());


> clean up compiler warnings
> --
>
> Key: SOLR-2288
> URL: https://issues.apache.org/jira/browse/SOLR-2288
> Project: Solr
>  Issue Type: Improvement
>Reporter: Hoss Man
>Assignee: Hoss Man
> Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch
>
>
> there's a ton of compiler warning in the solr tree, and it's high time we 
> cleaned them up, or annotate them to be suppressed so we can start making a 
> bigger stink when/if code is added to the tree thta produces warnings (we'll 
> never do a good job of noticing new warnings when we have ~175 existing ones)
> Using this issue to track related commits
> The goal of this issue should not be to change any functionality or APIs, 
> just deal with each warning in the most appropriate way;
> * fix generic declarations
> * add SuppressWarning anotation if it's safe in context

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/

2011-01-08 Thread Chris Hostetter

: > +  public static final Set EMPTY_STRING_SET = 
Collections.emptySet();
: > +
: 
: I don't know about this commit... i see a lot of EMPTY set's and maps
: defined statically here.
...
: I think we should be using the Collection methods, for example on your
: first file:

Hmmm... i am using the Collections method, it's the same set/map in each 
case, i'm just creating static ref's to them with the type information.  

My reading of the javadocs was that the implementation of emptySet() was 
going to just return the same immutable instance every time anyway, so 
there didn't seem to be any functional diff in reusing it like this -- it 
seemed like the natureal way to migrate from using Collections.EMPTY_SET,  
use our own local ref of the same object w/type info.

: -  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
: +  this(fieldName, fieldType, analyzer, Collections.emptySet());

Ah... see, i didn't even know that syntax was valid to bind the generic on 
a static method.  I'd only ever done the binding in the assignmet.  

yeah, sure -- i'll make a note to myself to go back and clean those up.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2854.


Resolution: Fixed

> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979178#action_12979178
 ] 

Michael McCandless commented on LUCENE-2828:


We won't fix this for 3.x or 4.0, since we've deprecated SimilarityDelegator, 
and forced hard cutover from Sim.lengthNorm -> Sim.computeNorm (LUCENE-2854).

But I'll leave this open in case we do another 2.9/3.0 release.

> SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
> --
>
> Key: LUCENE-2828
> URL: https://issues.apache.org/jira/browse/LUCENE-2828
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
>Reporter: Michael McCandless
> Fix For: 2.9.5, 3.0.4
>
> Attachments: LUCENE-2828.patch
>
>
> In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
> have access to the raw information (length, boost, etc.).
> But this class broke back compat with SimilarityDelegator.  We did add 
> computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
>  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
> that method will no longer be invoked.
> Not quite sure how to fix this since, somehow, we have to determine whether 
> the delegee's impl of computeNorm should be favored over the subclasses impl 
> of the "legacy" lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2828:
---

Fix Version/s: 3.0.4
   2.9.5

> SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
> --
>
> Key: LUCENE-2828
> URL: https://issues.apache.org/jira/browse/LUCENE-2828
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
>Reporter: Michael McCandless
> Fix For: 2.9.5, 3.0.4
>
> Attachments: LUCENE-2828.patch
>
>
> In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
> have access to the raw information (length, boost, etc.).
> But this class broke back compat with SimilarityDelegator.  We did add 
> computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
>  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
> that method will no longer be invoked.
> Not quite sure how to fix this since, somehow, we have to determine whether 
> the delegee's impl of computeNorm should be favored over the subclasses impl 
> of the "legacy" lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979174#action_12979174
 ] 

Michael McCandless commented on LUCENE-2854:


bq. Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

I would love to!  But I think that's for another day...

I looked into this and got stuck with BoostingQuery, which rewrites to an anon 
subclass of BQ overriding its getSimilarity in turn override its coord method.  
Rather twisted... if we can do this differently I think we could remove 
Query.getSimilarity.

> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979164#action_12979164
 ] 

Robert Muir commented on LUCENE-1260:
-

bq. Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

I think this is totally what we should try to do in trunk, especially after 
LUCENE-2846.

In this case, i want to fix the issue in a backwards-compatible way for lucene 
3.x
The warning is a little crazy I know, really people shouldnt rely upon their 
encoder being used for *fake norms*.
But i think its fair to document the corner case, just because its not really 
fixable easily in 3.x

For trunk, here is what i suggest:
* LUCENE-2846: remove all uses of fake norms. We never fill fake norms anymore 
at all, once we fix this issue. If you have a non-atomic reader with two 
segments, and one has no norms, then the whole norms[] should be null. this is 
consistent with omitTF. So, for example MultiNorms would never create fake
norms.
* LUCENE-2854: Mike is working on some issues i think where BooleanQuery uses 
this static or some other silliness with Similarity, i think we can clean that 
up there.
* finally at this point, I would like to remove 
Similarity.getDefault/setDefault alltogether. I would prefer instead that 
IndexSearcher has a single 'DefaultSimilarity' that is the default value if you 
don't provide one, and likewise with IndexWriterConfig.


> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
> Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
> LUCENE-1260_defaultsim.patch
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979162#action_12979162
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

And there's the case of the thread calling flush doesn't yet have a DWPT, it's 
going to need to get one assigned to it, however the one assigned may not be 
the max ram consumer.  What'll we do then?  If the user explicitly called flush 
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however 
that gets hairy with wait notifies (almost like the global lock?).

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979160#action_12979160
 ] 

Uwe Schindler commented on LUCENE-1260:
---

bq. Here's a patch for the general case, and it also adds a warning that you 
should set your similarity with Similarity.setDefault, especially if you omit 
norms. 

Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
> Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
> LUCENE-1260_defaultsim.patch
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1260:


Attachment: LUCENE-1260_defaultsim.patch

Here's a patch for the general case, and it also adds a warning
that you should set your similarity with Similarity.setDefault, especially if 
you omit norms.

We can backport this to 3.x

The other cases involve fake norms, which I think we should completely remove 
in trunk
with LUCENE-2846, then there is no longer an issue and we can remove the 
warning in trunk.


> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
> Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
> LUCENE-1260_defaultsim.patch
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979149#action_12979149
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}As soon as a DWPT is pulled from production for flushing, it loses all 
thread affinity and becomes unavailable until its flush finishes. When a thread 
needs a DWPT, it tries to pick the one it last had (affinity) but if that one's 
busy, it picks a new one. If none are available but we are below our max DWPT 
count, it spins up a new one?{quote}

Right.

{quote}With the proposed approach, all docs added (or in the process of being 
added) will make it into the flushed segments once the flush returns; newly 
added docs after the flush call started may or not make it. But this is fine? I 
mean, if the app has stronger requirements then it should externally 
sync?{quote}

Ok.  The proposed change is simply the thread calling add doc will flush it's 
DWPT if needed, take it offline while doing so, and return it when completed.  
I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile?

{quote}Why would we lose them? Wouldn't that DWPT just go back into rotation 
once the flush is done?{quote}

Yes, we just need to change the existing code a bit then.

However I think we may still need the global lock for close, eg, today we're 
preventing the user from adding docs during close, after this issue is merged 
that behavior would change?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979146#action_12979146
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?
{quote}

What's a "hard flush"?

With the proposed approach, all docs added (or in the process of being added) 
will make it into the flushed segments once the flush returns; newly added docs 
after the flush call started may or not make it.  But this is fine?  I mean, if 
the app has stronger requirements then it should externally sync?

bq. Also I think we need to clear the thread bindings of a DWPT just prior to 
the flush of the DWPT? 

Right.

As soon as a DWPT is pulled from production for flushing, it loses all thread 
affinity and becomes unavailable until its flush finishes.  When a thread needs 
a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, 
it picks a new one.  If none are available but we are below our max DWPT count, 
it spins up a new one?

{quote}
Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling?
{quote}

Why would we lose them?  Wouldn't that DWPT just go back into rotation once the 
flush is done?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979144#action_12979144
 ] 

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:


Changes are:
- Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
- Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
- Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache "core aware", each core has now an AEProvider for each analysis engine's 
path.
- The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

  was (Author: teofili):
Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache "core aware", each core has now an AEProvider for each analysis engine's 
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.
  
> Provide a Solr module for dynamic metadata extraction/indexing with Apache 
> UIMA
> ---
>
> Key: SOLR-2129
> URL: https://issues.apache.org/jira/browse/SOLR-2129
> Project: Solr
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Robert Muir
> Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
> SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
> SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be 
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document 
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which 
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
> (with a tokenizer and an hidden Markov model tagger), named entities, 
> language, suggested category, keywords and concepts (exploiting external 
> services from OpenCalais and AlchemyAPI). Such an implementation can be 
> easily extended adding or selecting different UIMA analysis engines, both 
> from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979144#action_12979144
 ] 

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:


Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache "core aware", each core has now an AEProvider for each analysis engine's 
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

  was (Author: teofili):
Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache "core aware", each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.
  
> Provide a Solr module for dynamic metadata extraction/indexing with Apache 
> UIMA
> ---
>
> Key: SOLR-2129
> URL: https://issues.apache.org/jira/browse/SOLR-2129
> Project: Solr
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Robert Muir
> Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
> SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
> SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be 
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document 
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which 
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
> (with a tokenizer and an hidden Markov model tagger), named entities, 
> language, suggested category, keywords and concepts (exploiting external 
> services from OpenCalais and AlchemyAPI). Such an implementation can be 
> easily extended adding or selecting different UIMA analysis engines, both 
> from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version-5.patch

Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache "core aware", each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.

> Provide a Solr module for dynamic metadata extraction/indexing with Apache 
> UIMA
> ---
>
> Key: SOLR-2129
> URL: https://issues.apache.org/jira/browse/SOLR-2129
> Project: Solr
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Robert Muir
> Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
> SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
> SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be 
> exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document 
> and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which 
> triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
> (with a tokenizer and an hidden Markov model tagger), named entities, 
> language, suggested category, keywords and concepts (exploiting external 
> services from OpenCalais and AlchemyAPI). Such an implementation can be 
> easily extended adding or selecting different UIMA analysis engines, both 
> from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 3511 - Failure

2011-01-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3511/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety

Error Message:
unable to create new native thread

Stack Trace:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:614)
at 
org.apache.lucene.search.TestThreadSafe.doTest(TestThreadSafe.java:133)
at 
org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety(TestThreadSafe.java:152)
at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:255)




Build Log (for compile errors):
[...truncated 8566 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979141#action_12979141
 ] 

Robert Muir commented on LUCENE-2854:
-

Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

{noformat}
  /** Expert: Returns the Similarity implementation to be used for this query.
   * Subclasses may override this method to specify their own Similarity
   * implementation, perhaps one that delegates through that of the Searcher.
   * By default the Searcher's Similarity implementation is returned.*/
{noformat}

> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2854:


Attachment: LUCENE-2854_fuzzylikethis.patch

here is the patch for fuzzylikethis for trunk... so you can remove the 
delegator completely in trunk.


> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979139#action_12979139
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also, don't we need the global lock for commit/close?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright

>>
Nope - wasn't me that added the license stuff into NOTICE.txt ;-)
But, including Jetty's NOTICE seems appropriate for our NOTICE.  It's
just the license parts of the HSQLDB and SLF4J that should be moved to
LICENSE.txt
<<

The NOTICE text is actually different from the LICENSE text for HSQLDB, which 
is why I thought it must have come from an HSQLDB NOTICE file.

Karl


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979138#action_12979138
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}So all that's guaranteed after the global flush() returns is that all
state present prior to when flush() is invoked, is moved to disk. Ie if addDocs
are still happening concurrently then the DWPTs will start filling up again
even while the "global flush" runs. That's fine.{quote}

What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?

Also I think we need to clear the thread bindings of a DWPT just prior to the
flush of the DWPT? Otherwise (when multiple threads are mapped to a single
DWPT) the other threads will wait on the [main] DWPT flush when they should be
spinning up a new DWPT? 

Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs
or will that noticeably affect indexing performance? 

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Robert Muir
On Sat, Jan 8, 2011 at 10:06 AM, Yonik Seeley
 wrote:

> There also wasn't any business about "and then add _nothing_ unless
> you can find explicit policy documented
> somewhere in the ASF that says it is required."  I was following
> examples from other projects and any docs I could find at the time,
> but this was back in '06.
>

Not sure there is now either, this is likely just someone's opinion.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Yonik Seeley
On Sat, Jan 8, 2011 at 8:10 AM,   wrote:
> From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
> Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
> Solr's NOTICE.txt?

Nope - wasn't me that added the license stuff into NOTICE.txt ;-)
But, including Jetty's NOTICE seems appropriate for our NOTICE.  It's
just the license parts of the HSQLDB and SLF4J that should be moved to
LICENSE.txt

There also wasn't any business about "and then add _nothing_ unless
you can find explicit policy documented
somewhere in the ASF that says it is required."  I was following
examples from other projects and any docs I could find at the time,
but this was back in '06.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Grant Ingersoll
Because they are shipped with Solr.  I don't see why it hurts to give people 
information about what's in the download.


On Jan 8, 2011, at 8:10 AM,   
wrote:

> From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
> Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
> Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
> this question. ;-)
> 
> Karl
> 
> 
> From: ext Robert Muir [rcm...@gmail.com]
> Sent: Saturday, January 08, 2011 7:11 AM
> To: dev@lucene.apache.org
> Subject: Re: LICENSE/NOTICE file contents
> 
> You are probably right... the LICENSE.txt also contains many instances
> of incorrect capitalization, I noticed that all versions of of this
> file I can find anywhere have this problem :)
> 
> On Sat, Jan 8, 2011 at 6:14 AM,   wrote:
>> This list might be interested to know that the current Solr LICENSE and 
>> NOTICE file contents are not Apache standard.  The ManifoldCF project based 
>> its LICENSE and NOTICE files on the Solr ones and got the following icy 
>> reception in the incubator:
>> 
 
>> The NOTICE file is still incorrect and includes a lot of unnecessary
>> stuff. Understanding how to do releases with the correct legal files
>> is one of the important parts of incubation and as this is the first
>> release for the poddling i think this needs to be sorted out.
>> 
>> For the NOTICE file, start with the following text (between the ---'s):
>> 
>> ---
>> Apache ManifestCF
>> Copyright 2010 The Apache Software Foundation
>> 
>> This product includes software developed by
>> The Apache Software Foundation (http://www.apache.org/).
>> ---
>> 
>> and then add _nothing_ unless you can find explicit policy documented
>> somewhere in the ASF that says it is required. If someone wants to add
>> something ask for the URL where the requirement is documented. The
>> NOTICE file should only include required notices, the other text thats
>> in the current NOTICE file could go in a README file, see
>> http://www.apache.org/legal/src-headers.html#notice
>> 
>> For the LICENSE file, it should start with the AL as the current one
>> does, and then include the text for all the other licenses used in the
>> distribution. Those license that are currently in the NOTICE file
>> should be moved to the LICENSE file and then you need to verify that
>> all the 3rd party dependencies in the src and binary distributions are
>> also in the LICENSE files of those distributions.
>> 
>> <<
>> 
>> Our NOTICE includes the following, which was taken from Solr (because we 
>> have a similar dependency).  I'd like to know whether it is a valid thing to 
>> include, and where it says that "somewhere in Apache":
>> 
 
>> =
>> == Jetty Notice==
>> =
>> ==
>> Jetty Web Container
>> Copyright 1995-2006 Mort Bay Consulting Pty Ltd
>> ==
>> 
>> This product includes some software developed at The Apache Software
>> Foundation (http://www.apache.org/).
>> 
>> The javax.servlet package used by Jetty is copyright
>> Sun Microsystems, Inc and Apache Software Foundation. It is
>> distributed under the Common Development and Distribution License.
>> You can obtain a copy of the license at
>> https://glassfish.dev.java.net/public/CDDLv1.0.html.
>> 
>> The UnixCrypt.java code ~Implements the one way cryptography used by
>> Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
>> modified April 2001  by Iris Van den Broeke, Daniel Deville.
>> 
>> The default JSP implementation is provided by the Glassfish JSP engine
>> from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
>> Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.
>> 
>> Some portions of the code are Copyright:
>> 2006 Tim Vernum
>> 1999 Jason Gilbert.
>> 
>> The jboss integration module contains some LGPL code.
>> 
>> =
>> == HSQLDB Notice   ==
>> =
>> 
>> For content, code, and products originally developed by Thomas Mueller and 
>> the Hypersonic SQL Group:
>> 
>> Copyright (c) 1995-2000 by the Hypersonic SQL Group.
>> All rights reserved.
>> 
>> Redistribution and use in source and binary forms, with or without
>> modification, are permitted provided that the following conditions are met:
>> 
>> Redistributions of source code must retain the above copyright notice, this
>> list of conditions and the following disclaimer.
>> 
>> Redistributions in binary form must reproduce the 

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979129#action_12979129
 ] 

Michael McCandless commented on LUCENE-2324:


bq. I guess we don't really need the global lock. A thread performing the 
"global flush" could still acquire each thread state before it starts flushing, 
but return a threadState to the pool once that particular threadState is done 
flushing?

Good question... we could (in theory) also flush them concurrently?  But, since 
we don't "own" the threads in IW, we can't easily do that, so I think no global 
lock, go through all DWPTs w/ current thread and flush, sequentially?  So all 
that's guaranteed after the global flush() returns is that all state present 
prior to when flush() is invoked, is moved to disk.  Ie if addDocs are still 
happening concurrently then the DWPTs will start filling up again even while 
the "global flush" runs.  That's fine.

{quote}

A related question is: Do we want to piggyback on multiple threads when a 
global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards 
addDocument(). When should addDocument() happen? 
a) After all DWPTs finished flushing? 
b) After at least one DWPT finished flushing and is available again?
c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

a) is currently implemented, but I think not really what we want.
b) is probably best for RT, because it means the lowest indexing latency for 
the new document to be added.
c) probably means the best overall throughput (depending even on hardware like 
disk speed, etc)
{quote}

I think start simple -- the addDocument always happens?  Ie it's never 
coordinated w/ the ongoing flush.  It picks a free DWPT like normal, and since 
flush is single threaded, there should always be a free DWPT?

Longer term c) would be great, or, if IW has an ES then it'd send multiple 
flush jobs to the ES.

{quote}
For whatever option we pick, we'll have to carefully think about error 
handling. It's quite straightforward for a) (just commit all flushed segments 
to SegmentInfos when the global flush completed succesfully). But for b) and c) 
it's unclear what should happen if a DWPT flush fails after some completed 
already successfully before.
{quote}

I think we should continue what we do today?  Ie, if it's an 'aborting' 
exception, then the entire segment held by that DWPT is discarded?  And we then 
throw this exc back to caller (and don't try to flush any other segments)?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979128#action_12979128
 ] 

Michael McCandless commented on LUCENE-2854:


The above patch applies to 3.x

For trunk I plan to remove SimliarityDelegator from core, and move it 
(deprecated) into contrib/queries/... (private to FuzzyLikeThisQ).  At some 
point [later] we can fix FuzzyLikeThisQ to not use it...

> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Grant Ingersoll
The weird thing is, all of our collectors, IMO, are optimized for the 
non-paging scenario, when I would venture to guess that the very large majority 
of users out there do paging.  AFAICT, about the only people who don't do 
paging are those who do deep, downstream analysis which requires them to 
retrieve 100's or 1000's or more of results at a time (I've seen as much as a 
million used in production) as part of a batch job.

See https://issues.apache.org/jira/browse/LUCENE-2215 and 
https://issues.apache.org/jira/browse/SOLR-1726 for the issues tracking this.

-Grant

On Jan 8, 2011, at 7:11 AM, Earwin Burrfoot wrote:

> On Mon, Jan 3, 2011 at 18:18, Yonik Seeley  wrote:
>> On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
>> Cominvent wrote:
>>> The problem with large "start" is probably worse when sharding is involved. 
>>> Anyone know how the shard component goes about fetching 
>>> start=100&rows=10 from say 10 shards? Does it have to merge sorted 
>>> lists of 1mill+10 docsids from each shard which is the worst case?
>> 
>> Yep, that's how it works today.
>> 
> 
> Technically, if your docs have a non-biased (in regards to their
> sort-value) distribution across shards, you can fetch much less than
> topN docs from each shard.
> I played with the idea, and it worked for me. Though later I dropped
> the opto, as it complicated things somewhat and my users aren't
> querying gazillions of docs often.
> 
> 
> -- 
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright
>From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
>Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
>Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
>this question. ;-)

Karl


From: ext Robert Muir [rcm...@gmail.com]
Sent: Saturday, January 08, 2011 7:11 AM
To: dev@lucene.apache.org
Subject: Re: LICENSE/NOTICE file contents

You are probably right... the LICENSE.txt also contains many instances
of incorrect capitalization, I noticed that all versions of of this
file I can find anywhere have this problem :)

On Sat, Jan 8, 2011 at 6:14 AM,   wrote:
> This list might be interested to know that the current Solr LICENSE and 
> NOTICE file contents are not Apache standard.  The ManifoldCF project based 
> its LICENSE and NOTICE files on the Solr ones and got the following icy 
> reception in the incubator:
>
>>>
> The NOTICE file is still incorrect and includes a lot of unnecessary
> stuff. Understanding how to do releases with the correct legal files
> is one of the important parts of incubation and as this is the first
> release for the poddling i think this needs to be sorted out.
>
> For the NOTICE file, start with the following text (between the ---'s):
>
> ---
> Apache ManifestCF
> Copyright 2010 The Apache Software Foundation
>
> This product includes software developed by
> The Apache Software Foundation (http://www.apache.org/).
> ---
>
> and then add _nothing_ unless you can find explicit policy documented
> somewhere in the ASF that says it is required. If someone wants to add
> something ask for the URL where the requirement is documented. The
> NOTICE file should only include required notices, the other text thats
> in the current NOTICE file could go in a README file, see
> http://www.apache.org/legal/src-headers.html#notice
>
> For the LICENSE file, it should start with the AL as the current one
> does, and then include the text for all the other licenses used in the
> distribution. Those license that are currently in the NOTICE file
> should be moved to the LICENSE file and then you need to verify that
> all the 3rd party dependencies in the src and binary distributions are
> also in the LICENSE files of those distributions.
>
> <<
>
> Our NOTICE includes the following, which was taken from Solr (because we have 
> a similar dependency).  I'd like to know whether it is a valid thing to 
> include, and where it says that "somewhere in Apache":
>
>>>
> =
> == Jetty Notice==
> =
> ==
>  Jetty Web Container
>  Copyright 1995-2006 Mort Bay Consulting Pty Ltd
> ==
>
> This product includes some software developed at The Apache Software
> Foundation (http://www.apache.org/).
>
> The javax.servlet package used by Jetty is copyright
> Sun Microsystems, Inc and Apache Software Foundation. It is
> distributed under the Common Development and Distribution License.
> You can obtain a copy of the license at
> https://glassfish.dev.java.net/public/CDDLv1.0.html.
>
> The UnixCrypt.java code ~Implements the one way cryptography used by
> Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
> modified April 2001  by Iris Van den Broeke, Daniel Deville.
>
> The default JSP implementation is provided by the Glassfish JSP engine
> from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
> Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.
>
> Some portions of the code are Copyright:
>  2006 Tim Vernum
>  1999 Jason Gilbert.
>
> The jboss integration module contains some LGPL code.
>
> =
> == HSQLDB Notice   ==
> =
>
> For content, code, and products originally developed by Thomas Mueller and 
> the Hypersonic SQL Group:
>
> Copyright (c) 1995-2000 by the Hypersonic SQL Group.
> All rights reserved.
>
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions are met:
>
> Redistributions of source code must retain the above copyright notice, this
> list of conditions and the following disclaimer.
>
> Redistributions in binary form must reproduce the above copyright notice,
> this list of conditions and the following disclaimer in the documentation
> and/or other materials provided with the distribution.
>
> Neither the name of the Hypersonic SQL Group nor the names of its
> contributors may be used to endorse or promote products derived from this

[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2854:
---

Attachment: LUCENE-2854.patch

I think we should simply make a hard break on the Sim.lengthNorm ->
computeNorm cutover.  Subclassing sim is an expert thing, and, I'd
rather apps see a compilation error on upgrade so that they realize
their lengthNorm wasn't being called this whole time because of
LUCENE-2828 (and that they must now cutover to computeNorm).

So I made lengthNorm final (and throws UOE), computeNorm abstract.  I
deprecated SimilarityDelegator, and fixed BQ to not use it anymore.
The only other use is FuzzyLikeThisQuery, but fixing that is a little
too involved for today.


> Deprecate SimilarityDelegator and Similarity.lengthNorm
> ---
>
> Key: LUCENE-2854
> URL: https://issues.apache.org/jira/browse/LUCENE-2854
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2854.patch
>
>
> SimilarityDelegator is a back compat trap (see LUCENE-2828).
> Apps should just [statically] subclass Sim or DefaultSim; if they really need 
> "runtime subclassing" then they can make their own app-level delegator.
> Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
> in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979118#action_12979118
 ] 

Michael McCandless commented on LUCENE-2831:


bq. It seems we also need to migrate FieldComparator to use ReaderContext 
(eventually AtomicReaderContext)?

+1

And also Collector?

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Robert Muir
You are probably right... the LICENSE.txt also contains many instances
of incorrect capitalization, I noticed that all versions of of this
file I can find anywhere have this problem :)

On Sat, Jan 8, 2011 at 6:14 AM,   wrote:
> This list might be interested to know that the current Solr LICENSE and 
> NOTICE file contents are not Apache standard.  The ManifoldCF project based 
> its LICENSE and NOTICE files on the Solr ones and got the following icy 
> reception in the incubator:
>
>>>
> The NOTICE file is still incorrect and includes a lot of unnecessary
> stuff. Understanding how to do releases with the correct legal files
> is one of the important parts of incubation and as this is the first
> release for the poddling i think this needs to be sorted out.
>
> For the NOTICE file, start with the following text (between the ---'s):
>
> ---
> Apache ManifestCF
> Copyright 2010 The Apache Software Foundation
>
> This product includes software developed by
> The Apache Software Foundation (http://www.apache.org/).
> ---
>
> and then add _nothing_ unless you can find explicit policy documented
> somewhere in the ASF that says it is required. If someone wants to add
> something ask for the URL where the requirement is documented. The
> NOTICE file should only include required notices, the other text thats
> in the current NOTICE file could go in a README file, see
> http://www.apache.org/legal/src-headers.html#notice
>
> For the LICENSE file, it should start with the AL as the current one
> does, and then include the text for all the other licenses used in the
> distribution. Those license that are currently in the NOTICE file
> should be moved to the LICENSE file and then you need to verify that
> all the 3rd party dependencies in the src and binary distributions are
> also in the LICENSE files of those distributions.
>
> <<
>
> Our NOTICE includes the following, which was taken from Solr (because we have 
> a similar dependency).  I'd like to know whether it is a valid thing to 
> include, and where it says that "somewhere in Apache":
>
>>>
> =
> ==     Jetty Notice                                                    ==
> =
> ==
>  Jetty Web Container
>  Copyright 1995-2006 Mort Bay Consulting Pty Ltd
> ==
>
> This product includes some software developed at The Apache Software
> Foundation (http://www.apache.org/).
>
> The javax.servlet package used by Jetty is copyright
> Sun Microsystems, Inc and Apache Software Foundation. It is
> distributed under the Common Development and Distribution License.
> You can obtain a copy of the license at
> https://glassfish.dev.java.net/public/CDDLv1.0.html.
>
> The UnixCrypt.java code ~Implements the one way cryptography used by
> Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
> modified April 2001  by Iris Van den Broeke, Daniel Deville.
>
> The default JSP implementation is provided by the Glassfish JSP engine
> from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
> Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.
>
> Some portions of the code are Copyright:
>  2006 Tim Vernum
>  1999 Jason Gilbert.
>
> The jboss integration module contains some LGPL code.
>
> =
> ==     HSQLDB Notice                                                   ==
> =
>
> For content, code, and products originally developed by Thomas Mueller and 
> the Hypersonic SQL Group:
>
> Copyright (c) 1995-2000 by the Hypersonic SQL Group.
> All rights reserved.
>
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions are met:
>
> Redistributions of source code must retain the above copyright notice, this
> list of conditions and the following disclaimer.
>
> Redistributions in binary form must reproduce the above copyright notice,
> this list of conditions and the following disclaimer in the documentation
> and/or other materials provided with the distribution.
>
> Neither the name of the Hypersonic SQL Group nor the names of its
> contributors may be used to endorse or promote products derived from this
> software without specific prior written permission.
>
> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP,
> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
> EXEMPLARY, OR CONSEQUE

Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Earwin Burrfoot
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley  wrote:
> On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
> Cominvent wrote:
>> The problem with large "start" is probably worse when sharding is involved. 
>> Anyone know how the shard component goes about fetching 
>> start=100&rows=10 from say 10 shards? Does it have to merge sorted lists 
>> of 1mill+10 docsids from each shard which is the worst case?
>
> Yep, that's how it works today.
>

Technically, if your docs have a non-biased (in regards to their
sort-value) distribution across shards, you can fetch much less than
topN docs from each shard.
I played with the idea, and it worked for me. Though later I dropped
the opto, as it complicated things somewhat and my users aren't
querying gazillions of docs often.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/

2011-01-08 Thread Robert Muir
On Fri, Jan 7, 2011 at 10:47 PM,   wrote:
>
> +  public static final Set EMPTY_STRING_SET = Collections.emptySet();
> +

I don't know about this commit... i see a lot of EMPTY set's and maps
defined statically here.
There is no advantage to doing this, even the javadocs explain:
Implementation note: Implementations of this method need not create a
separate (Set|Map|List) object for each call. Using this method is
likely to have comparable cost to using the like-named field. (Unlike
this method, the field does not provide type safety.)

I think we should be using the Collection methods, for example on your
first file:

Index: solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java
===
--- solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java   
(revision
1056691)
+++ solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java   
(working
copy)
@@ -47,8 +47,6 @@
  */
 public abstract class AnalysisRequestHandlerBase extends RequestHandlerBase {

-  public static final Set EMPTY_STRING_SET = Collections.emptySet();
-
   public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) throws Exception {
 rsp.add("analysis", doAnalysis(req));
   }
@@ -343,7 +341,7 @@
  *
  */
 public AnalysisContext(String fieldName, FieldType fieldType,
Analyzer analyzer) {
-  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
+  this(fieldName, fieldType, analyzer, Collections.emptySet());
 }

 /**
I

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright
This list might be interested to know that the current Solr LICENSE and NOTICE 
file contents are not Apache standard.  The ManifoldCF project based its 
LICENSE and NOTICE files on the Solr ones and got the following icy reception 
in the incubator:

>>
The NOTICE file is still incorrect and includes a lot of unnecessary
stuff. Understanding how to do releases with the correct legal files
is one of the important parts of incubation and as this is the first
release for the poddling i think this needs to be sorted out.

For the NOTICE file, start with the following text (between the ---'s):

---
Apache ManifestCF
Copyright 2010 The Apache Software Foundation

This product includes software developed by
The Apache Software Foundation (http://www.apache.org/).
---

and then add _nothing_ unless you can find explicit policy documented
somewhere in the ASF that says it is required. If someone wants to add
something ask for the URL where the requirement is documented. The
NOTICE file should only include required notices, the other text thats
in the current NOTICE file could go in a README file, see
http://www.apache.org/legal/src-headers.html#notice

For the LICENSE file, it should start with the AL as the current one
does, and then include the text for all the other licenses used in the
distribution. Those license that are currently in the NOTICE file
should be moved to the LICENSE file and then you need to verify that
all the 3rd party dependencies in the src and binary distributions are
also in the LICENSE files of those distributions.

<<

Our NOTICE includes the following, which was taken from Solr (because we have a 
similar dependency).  I'd like to know whether it is a valid thing to include, 
and where it says that "somewhere in Apache":

>>
=
== Jetty Notice==
=
==
 Jetty Web Container 
 Copyright 1995-2006 Mort Bay Consulting Pty Ltd
==

This product includes some software developed at The Apache Software 
Foundation (http://www.apache.org/).

The javax.servlet package used by Jetty is copyright 
Sun Microsystems, Inc and Apache Software Foundation. It is 
distributed under the Common Development and Distribution License.
You can obtain a copy of the license at 
https://glassfish.dev.java.net/public/CDDLv1.0.html.

The UnixCrypt.java code ~Implements the one way cryptography used by
Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
modified April 2001  by Iris Van den Broeke, Daniel Deville.

The default JSP implementation is provided by the Glassfish JSP engine
from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.

Some portions of the code are Copyright:
  2006 Tim Vernum 
  1999 Jason Gilbert.

The jboss integration module contains some LGPL code.

=
== HSQLDB Notice   ==
=

For content, code, and products originally developed by Thomas Mueller and the 
Hypersonic SQL Group:

Copyright (c) 1995-2000 by the Hypersonic SQL Group.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

Neither the name of the Hypersonic SQL Group nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP,
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This software cons