date:20110502


[ 
https://issues.apache.org/jira/browse/LUCENE-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027570#comment-13027570
 ] 

Simon Willnauer commented on LUCENE-3057:
-

bq. Hi Simon, I think you meant to set the lockfactory in the finally block?
thanks robert for catching this I removed the return statement in revision 
1098375.

Backported to 3.x in revision 1098505

 LuceneTestCase#newFSDirectoryImpl misses to set LockFactory if ctor call 
 throws exception
 -

 Key: LUCENE-3057
 URL: https://issues.apache.org/jira/browse/LUCENE-3057
 Project: Lucene - Java
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3057.patch, LUCENE-3057_bug.patch


 selckin reported on IRC that if you run ant test -Dtestcase=TestLockFactory 
 -Dtestmethod=testNativeFSLockFactoryPrefix -Dtests.directory=FSDirectory the 
 test fails. Since FSDirectory is an abstract class it can not be instantiated 
 so our code falls back to FSDirector.open. yet we miss to set the given 
 lockFactory though.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3041) Support Query Visting / Walking


 [ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-3041:
---

Assignee: Simon Willnauer

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027575#comment-13027575
 ] 

Simon Willnauer commented on LUCENE-3041:
-

bq.New patch that implements what I said in the previous comments (except for 
the IS changes).
Chris, patch looks good! Are you going to add the IS changes here too? I wonder 
if we could move the MethodDispatchException into InvocationDispatcher as a 
nested class I don't think we need an extra file for this class.

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Chris Male
Priority: Minor
 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3056) Support Query Rewriting Caching

[
https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simon Willnauer updated LUCENE-3056:

Component/s: Search
Lucene Fields: [New, Patch Available] (was: [New])
Affects Version/s: 4.0
Fix Version/s: 4.0

Support Query Rewriting Caching
---

Key: LUCENE-3056
URL: https://issues.apache.org/jira/browse/LUCENE-3056
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Fix For: 4.0

Attachments: LUCENE-3056.patch

Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't
right for caching the rewrites of Querys. Although we still intend to
introduce the Query / Walker for advanced query transformations, rewriting
still serves a purpose for very specific implementation detail writing. As
such, it can be very expensive. So I think we should introduce first class
support for rewrite caching. I also feel the key is to make the caching as
transparent as possible, to reduce the strain on Query implementors.
The TermState idea gave me the idea of maybe making a RewriteState /
RewriteCache / RewriteInterceptor, which would be consulted for rewritten
Querys. It would then maintain an internal cache that it would check. If a
value wasn't found, it'd then call Query#rewrite, and cache the result.
By having this external rewrite source, people could 'pre' rewrite Querys if
they were particularly expensive but also common.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3041) Support Query Visting / Walking


 [ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3041:


Lucene Fields: [New, Patch Available]  (was: [New])
Affects Version/s: 4.0
Fix Version/s: 4.0

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3056) Support Query Rewriting Caching


[ 
https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027578#comment-13027578
 ] 

Simon Willnauer commented on LUCENE-3056:
-

Hey chris,

here are some comments:

* I like that you only have to change BooleanQuery to enable this!! nice!
* Can we rename RewriteState into RewriteContext its just more consistent to 
all the other ctx we pass to query and scorer?
* Can we rename DefaultRewriteState into CachingRewriteContext and make a 
RewriteContext that simply does query.rewrite() that way nothing changes by 
default and we can use a static instance in Query#rewrite(IndexReader) maybe as 
an anonymous inner class in Query?
* Can we move CachingRewriteContext into lucene/src/java/org/apache/lucene/util?

This change somewhat depends on LUCENE-3041 since we might wanna pass that 
RewriteContext on a per segment level right? So maybe we should link those 
issues.


 Support Query Rewriting Caching
 ---

 Key: LUCENE-3056
 URL: https://issues.apache.org/jira/browse/LUCENE-3056
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
 Fix For: 4.0

 Attachments: LUCENE-3056.patch


 Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't 
 right for caching the rewrites of Querys.  Although we still intend to 
 introduce the Query / Walker for advanced query transformations, rewriting 
 still serves a purpose for very specific implementation detail writing.  As 
 such, it can be very expensive.  So I think we should introduce first class 
 support for rewrite caching.  I also feel the key is to make the caching as 
 transparent as possible, to reduce the strain on Query implementors.
 The TermState idea gave me the idea of maybe making a RewriteState / 
 RewriteCache / RewriteInterceptor, which would be consulted for rewritten 
 Querys.  It would then maintain an internal cache that it would check.  If a 
 value wasn't found, it'd then call Query#rewrite, and cache the result.
 By having this external rewrite source, people could 'pre' rewrite Querys if 
 they were particularly expensive but also common.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2480) Text extraction of password protected files

2011-05-02 Thread Shinichiro Abe (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shinichiro Abe updated SOLR-2480:
-

Attachment: SOLR-2480-idea1.patch

Text extraction of password protected files
---

Key: SOLR-2480
URL: https://issues.apache.org/jira/browse/SOLR-2480
Project: Solr
Issue Type: Improvement
Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor
Attachments: SOLR-2480-idea1.patch

Proposal:
There are password-protected files. PDF, Office documents in 2007 format/97
format.
These files are posted using SolrCell.
We do not have to read these files if we do not know the reading password of
files.
So, these files may not be extracted text.
My requirement is that these files should be processed normally without
extracting text, and without throwing exception.
This background:
Now, when you post a password-protected file, solr returns 500 server error.
Solr catches the error in ExtractingDocumentLoader and throws TikException.
I use ManifoldCF.
If the solr server responds 500, ManifoldCF judge is that this
document should be retried because I have absolutely no idea what
happened.
And it attempts to retry posting many times without getting the password.
In the other case, my customer posts the files with embedded images.
Sometimes it seems that solr throws TikaException of unknown cause.
He wants to post just metadata without extracting text, but makes him stop
posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-05-02 Thread Shinichiro Abe (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027579#comment-13027579
]

Shinichiro Abe commented on SOLR-2480:
--

{quote}
But I think you want Solr to skip the content field because tika cannot extract
it for some reasons but add meta data fields, right?
{quote}
Yes, I want to post the metadate without contents that throw parse-error.
ExtractingDocumentLoader also should be fixed.
This patch expresses improvement ideas(1).
And I think SOLR-445 can resolve improvement ideas(2).

Text extraction of password protected files
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3060) Revise ThreadAffinityDocumentsWriterThreadPool queue handling

Revise ThreadAffinityDocumentsWriterThreadPool queue handling
-

 Key: LUCENE-3060
 URL: https://issues.apache.org/jira/browse/LUCENE-3060
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0


Spin-off from LUCENE-3023... In 
ThreadAffinityDocumentsWriterThreadPool#getAndLock() we had talked about 
switching from a per-threadstate queue (safeway model) to a single queue (whole 
foods)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3023) Land DWPT on trunk


[ 
https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027583#comment-13027583
 ] 

Simon Willnauer commented on LUCENE-3023:
-

bq. In ThreadAffinityDocumentsWriterThreadPool#getAndLock() we had talked about 
switching from a per-threadstate queue (safeway model) to a single queue (whole 
foods). I'm wondering if we should do that before we commit or change that 
later as a separate patch?

I opened LUCENE-3060 for this. @buschmi maybe you can add some more info to 
that issue if you recall the discussion?

{quote}
Committed merged branch to trunk revision: 1098427
Moved branch away as tag in revision: 1098428
{quote}

AWESOME! :)

 Land DWPT on trunk
 --

 Key: LUCENE-3023
 URL: https://issues.apache.org/jira/browse/LUCENE-3023
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3023-svn-diff.patch, 
 LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, 
 LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, 
 LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, 
 LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, 
 LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, 
 diffSources.patch, realtime-TestAddIndexes-3.txt, 
 realtime-TestAddIndexes-5.txt, 
 realtime-TestIndexWriterExceptions-assert-6.txt, 
 realtime-TestIndexWriterExceptions-npe-1.txt, 
 realtime-TestIndexWriterExceptions-npe-2.txt, 
 realtime-TestIndexWriterExceptions-npe-4.txt, 
 realtime-TestOmitTf-corrupt-0.txt


 With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so 
 we can proceed landing the DWPT development on trunk soon. I think one of the 
 bigger issues here is to make sure that all JavaDocs for IW etc. are still 
 correct though. I will start going through that first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3023) Land DWPT on trunk


[ 
https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027587#comment-13027587
 ] 

Uwe Schindler commented on LUCENE-3023:
---

The first full Jenkins Build also succeeded. When reviewing the first Clover 
Build report, I noticed 2 new final classes, that have no code coverage at all 
(see 
[https://builds.apache.org/hudson/job/Lucene-trunk/1549/clover-report/org/apache/lucene/index/pkg-summary.html]):

- DocFieldConsumers 
- DocFieldConsumersPerField

I am not sure if those are old relicts (dead code) or newly added ones, but not 
yet used.

 Land DWPT on trunk
 --

 Key: LUCENE-3023
 URL: https://issues.apache.org/jira/browse/LUCENE-3023
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3023-svn-diff.patch, 
 LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, LUCENE-3023.patch, 
 LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023_CHANGES.patch, 
 LUCENE-3023_CHANGES.patch, LUCENE-3023_iw_iwc_jdoc.patch, 
 LUCENE-3023_simonw_review.patch, LUCENE-3023_svndiff.patch, 
 LUCENE-3023_svndiff.patch, diffMccand.py, diffSources.patch, 
 diffSources.patch, realtime-TestAddIndexes-3.txt, 
 realtime-TestAddIndexes-5.txt, 
 realtime-TestIndexWriterExceptions-assert-6.txt, 
 realtime-TestIndexWriterExceptions-npe-1.txt, 
 realtime-TestIndexWriterExceptions-npe-2.txt, 
 realtime-TestIndexWriterExceptions-npe-4.txt, 
 realtime-TestOmitTf-corrupt-0.txt


 With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so 
 we can proceed landing the DWPT development on trunk soon. I think one of the 
 bigger issues here is to make sure that all JavaDocs for IW etc. are still 
 correct though. I will start going through that first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Chris Male (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027589#comment-13027589
 ] 

Chris Male commented on LUCENE-3041:


bq. Are you going to add the IS changes here too?

Yup, I'm just working through the best way to expose the API in the IS while 
supporting per segment walking.  I'll have something together in the next day 
or two.

bq. I wonder if we could move the MethodDispatchException into 
InvocationDispatcher as a nested class 

Good call.  I'll make the change and upload something immediately.

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3056) Support Query Rewriting Caching

2011-05-02 Thread Chris Male (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027592#comment-13027592
]

Chris Male commented on LUCENE-3056:

bq. This change somewhat depends on LUCENE-3041 since we might wanna pass that
RewriteContext on a per segment level right?

Yeah, thats very true. I'm wondering whether its best to rethink the
signatures of the #search methods in IS since we need to incorporate both this
and LUCENE-3041.

I'll upload a patch shortly addressing the other improvements.

Support Query Rewriting Caching
---

Key: LUCENE-3056
URL: https://issues.apache.org/jira/browse/LUCENE-3056
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Fix For: 4.0

Attachments: LUCENE-3056.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3056) Support Query Rewriting Caching

2011-05-02 Thread Chris Male (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-3056:
---

Attachment: LUCENE-3056.patch

Patch implementing Simon's suggestions

- RewriteState - RewriteContext
- DefaultRewriteState - org.apache.lucene.util.CachingRewriteContext
- Query now has a static anonymous inner class instance which does simple 
rewrite.

 Support Query Rewriting Caching
 ---

 Key: LUCENE-3056
 URL: https://issues.apache.org/jira/browse/LUCENE-3056
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
 Fix For: 4.0

 Attachments: LUCENE-3056.patch, LUCENE-3056.patch


 Out of LUCENE-3041, its become apparent that using a Visitor / Walker isn't 
 right for caching the rewrites of Querys.  Although we still intend to 
 introduce the Query / Walker for advanced query transformations, rewriting 
 still serves a purpose for very specific implementation detail writing.  As 
 such, it can be very expensive.  So I think we should introduce first class 
 support for rewrite caching.  I also feel the key is to make the caching as 
 transparent as possible, to reduce the strain on Query implementors.
 The TermState idea gave me the idea of maybe making a RewriteState / 
 RewriteCache / RewriteInterceptor, which would be consulted for rewritten 
 Querys.  It would then maintain an internal cache that it would check.  If a 
 value wasn't found, it'd then call Query#rewrite, and cache the result.
 By having this external rewrite source, people could 'pre' rewrite Querys if 
 they were particularly expensive but also common.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

How should one impl own MergeScheduler

Hi

I wanted to impl my own MergeScheduler (a variation of SerialMergeScheduler,
which does minor additional work), and found out I cannot really, for lack
of visible API on IndexWriter, such as getNextMerge() and merge(OneMerge) --
both exist, but are package-private.

It got me thinking -- how can anyone impl his own MergeScheduler today?
Perhaps people impl MergePolicy only?

Would it make sense to open this API to our users? Is there other API we
should consider opening w.r.t. MergeScheduler/Policy?

Shai

[jira] [Updated] (SOLR-2472) StatsComponent should support hierarchical facets

2011-05-02 Thread Dmitry Drozdov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Drozdov updated SOLR-2472:
-

Affects Version/s: 4.0

 StatsComponent should support hierarchical facets
 -

 Key: SOLR-2472
 URL: https://issues.apache.org/jira/browse/SOLR-2472
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.1, 4.0
Reporter: Dmitry Drozdov
 Attachments: SOLR-2472.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 It is currently possible to get only single layer of faceting in 
 StatsComponent.
 The proposal is it make it possible to specify stats.facet parameter like 
 this:
 stats=truestats.field=sFieldstats.facet=fField1,fField2
 and get the response like this:
 lst name=stats
 lst name=stats_fields
 lst name=sField
 double name=min1.0/double
 double name=max1.0/double
 double name=sum4.0/double
 long name=count4/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 lst name=facets
 lst name=fField1
 lst name=fField1Value1
 double name=min1.0/double
 double name=max1.0/double
 double name=sum2.0/double
 long name=count2/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 lst name=facets
 lst name=fField2
 lst name=fField2Value1
 double name=min1.0/double
 double name=max1.0/double
 double name=sum1.0/double
 long name=count1/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 /lst
 lst name=fField2Value2
 double name=min1.0/double
 double name=max1.0/double
 double name=sum1.0/double
 long name=count1/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 /lst
 /lst
 /lst
 /lst
 lst name=fField1Value2
 double name=min1.0/double
 double name=max1.0/double
 double name=sum2.0/double
 long name=count2/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 lst name=facets
 lst name=fField2
 lst name=fField2Value1
 double name=min1.0/double
 double name=max1.0/double
 double name=sum1.0/double
 long name=count1/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 /lst
 lst name=fField2Value2
 double name=min1.0/double
 double name=max1.0/double
 double name=sum1.0/double
 long name=count1/long
 long name=missing0/long
 double name=sumOfSquares/double
 double name=mean/double
 double name=stddev/double
 /lst
 /lst
 /lst
 /lst
 /lst
 /lst
 /lst
 /lst
 /lst

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7642/

1 tests failed.
REGRESSION:  org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
at java.lang.StringBuffer.append(StringBuffer.java:337)
at 
java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617)
at 
org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93)
at 
org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304)
at 
org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1091)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1023)




Build Log (for compile errors):
[...truncated 9253 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: How should one impl own MergeScheduler

I think we should open up these APIs?

And, we should make a test case that lives outside of oal.index, to
assert that in fact all needed APIs are not package private?

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 5:09 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 I wanted to impl my own MergeScheduler (a variation of SerialMergeScheduler,
 which does minor additional work), and found out I cannot really, for lack
 of visible API on IndexWriter, such as getNextMerge() and merge(OneMerge) --
 both exist, but are package-private.

 It got me thinking -- how can anyone impl his own MergeScheduler today?
 Perhaps people impl MergePolicy only?

 Would it make sense to open this API to our users? Is there other API we
 should consider opening w.r.t. MergeScheduler/Policy?

 Shai


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2483) DIH - an uppercase problem in query parameters

2011-05-02 Thread Lubo Torok (JIRA)

DIH - an uppercase problem in query parameters
--

 Key: SOLR-2483
 URL: https://issues.apache.org/jira/browse/SOLR-2483
 Project: Solr
  Issue Type: Bug
  Components: clients - java, contrib - DataImportHandler
Affects Versions: 3.1
 Environment: Windows Vista
Java 1.6
Reporter: Lubo Torok


I have two tables called PROBLEM and KOMENTAR(means 'comment' in English) 
in DB. One problem can have more comments. I want to index them all.
schema.xml looks as follows
... some fields ...
 field name=problem_id type=string stored=true required=true/
... some fields...

data-config.xml:
document name=problemy
entity name=problem query=select to_char(id) as problem_id, nazov as 
problem_nazov, cislo as problem_cislo, popis as problem_popis from problem 
pk=problem_id
  entity name=komentar query=select id as komentar_id, nazov as 
komentar_nazov, text as komentar_text from komentar where 
to_char(fk_problem)='${problem.PROBLEM_ID}'/   
/entity  
  /document

If you write '${problem.PROBLEM_ID}' in lower case, i.e. 
'${problem.problem_id}' SOLR will not import the inner entity. Seems strange to 
me and it took me some time to figure this out.

Note that primary key in PROBLEM is called ID. I defined the alias 
problem_id (yes,lower case) in SQL. In schema, there is this field defined as 
problem_id again in lower case. But, when I run
http://localhost:8983/solr/dataimport?command=full-importdebug=trueverbose=on
so I can see some debug information there is this part
...
lst name=verbose-output
−
lst name=entity:problem
−
lst name=document#1
−
str name=query
select to_char(id) as problem_id, nazov as problem_nazov, cislo as 
problem_cislo, popis as problem_popis from problem
/str
str name=time-taken0:0:0.465/str
str--- row #1-/str
str name=PROBLEM_NAZOVtest zodpovedneho/str
str name=PROBLEM_ID2533274790395945/str
str name=PROBLEM_CISLO201009304/str
str name=PROBLEM_POPIScsfdewafedewfw/str
str-/str
−
lst name=entity:komentar
−
str name=query
select id as komentar_id, nazov as komentar_nazov, text as komentar_text from 
komentar where to_char(fk_problem)='2533274790395945'
/str
...

where you can see that, internally, the fields of PROBLEM are represented in 
uppercase despite the user (me) had not defined them this way. The result is I 
guess that parameter referring to the parent entity ${entity.field} should 
always be in uppercase, i.e. ${entity.FIELD}.

Here is an example of the indexed entity as written after full-import command 
with debug and verbose on:
arr name=documents
−
lst
−
arr name=problem_nazov
strtest zodpovedneho/str
/arr
−
arr name=problem_id
str2533274790395945/str
/arr
−
arr name=problem_cislo
str201009304/str
/arr
−
arr name=problem_popis
strcsfdewafedewfw/str
/arr
−
arr name=komentar_id
strjava.math.BigDecimal:5066549580791985/str
/arr
−
arr name=komentar_text
stra.TXT/str
/arr
/lst

here are the field names in lower case. I consider this as a bug. Maybe I am 
wrong and its a feature. I work with SOLR only for few days.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot commented on LUCENE-3041:
-

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM:
--

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
Same can be said for tests.

What about throwing original invocation exception instead of the wrapper? Since 
we're emulating a language feature, a simple method call, it's logical to only 
throw custom exceptions in .. well .. exceptional cases, like ambiguity/no 
matching method. If client code throws Errors/RuntimeExceptions, they should be 
transparently rethrown.

  was (Author: earwin):
The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
  
 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation

Open IndexWriter API to allow custom MergeScheduler implementation
--

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0


IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
makes it impossible for someone to implement his own MergeScheduler. We should 
open up these API, as well as any other that can be useful for custom MS 
implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure

I slurped this hprof down and opened it w/ YourKit...

Something weird is going on, because there is a single massive (151
MB) string, stack local to one of the threads, filled with character
U+00B2.

The test itself looks innocuous; I don't think it creates any massive
stack local strings.

I'm baffled.  Robert maybe something crazy is happening in RuleBasedCollator?

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 5:53 AM, Apache Jenkins Server
hud...@hudson.apache.org wrote:
 Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7642/

 1 tests failed.
 REGRESSION:  
 org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe

 Error Message:
 Java heap space

 Stack Trace:
 java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at 
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at 
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
        at java.lang.StringBuffer.append(StringBuffer.java:337)
        at 
 java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617)
        at 
 org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93)
        at 
 org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304)
        at 
 org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1091)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1023)




 Build Log (for compile errors):
 [...truncated 9253 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure

2011-05-02 Thread Robert Muir

On Mon, May 2, 2011 at 6:43 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I slurped this hprof down and opened it w/ YourKit...

 Something weird is going on, because there is a single massive (151
 MB) string, stack local to one of the threads, filled with character
 U+00B2.

 The test itself looks innocuous; I don't think it creates any massive
 stack local strings.

 I'm baffled.  Robert maybe something crazy is happening in RuleBasedCollator?


thanks for debugging this... at first I think it must be a jre bug
just because the test has been toned down so many times.

but, this test doesn't oom intermittently in trunk right? So before
disabling the test and saying its out of our hands, it would be good
to check that its not a bug in the encoder
(Indexablebinarystringtools)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation


 [ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3061:
---

Attachment: LUCENE-3061.patch

Open up necessary API + add TestCustomMergeScheduler under 
src/test/o.a.l/index/publicapi.

The changes are very trivial. If you would like to suggest alternative package 
I should put the test in, I will gladly do it.

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation


[ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027621#comment-13027621
 ] 

Uwe Schindler commented on LUCENE-3061:
---

All of those the public API tests are directly under o.a.lucene at the moment.

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation


 [ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3061:
---

Attachment: LUCENE-3061.patch

Thanks Uwe ! Following your comment, I noticed there is a 
TestMergeSchedulerExternal under o.a.l, which covers extending 
ConcurrentMergeScheduler.

So I moved my MS impl + test case there.

I think this is ready to commit

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch, LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027626#comment-13027626
 ] 

Earwin Burrfoot commented on LUCENE-3061:
-

Mark these as @experimental?

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch, LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation

[
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027627#comment-13027627
]

Shai Erera commented on LUCENE-3061:

I don't think they are experimental though -- they exist for ages. We only made
them public.

I get your point - you don't think we should commit to this API signature, but
IMO we should -- if MS is a valid extension point by applications, we must
support this API, otherwise MS cannot be extended at all. Also, getNextMerge()
jdoc specifies Expert: the MergeScheeduler calls this method ... - this kind
of makes this API public long time ago, only it wasn't.

Open IndexWriter API to allow custom MergeScheduler implementation
--

Key: LUCENE-3061
URL: https://issues.apache.org/jira/browse/LUCENE-3061
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
Fix For: 3.2, 4.0

Attachments: LUCENE-3061.patch, LUCENE-3061.patch

IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which
makes it impossible for someone to implement his own MergeScheduler. We
should open up these API, as well as any other that can be useful for custom
MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 7642 - Failure

2011-05-02 Thread Robert Muir

On Mon, May 2, 2011 at 6:43 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I slurped this hprof down and opened it w/ YourKit...

 Something weird is going on, because there is a single massive (151
 MB) string, stack local to one of the threads, filled with character
 U+00B2.

 The test itself looks innocuous; I don't think it creates any massive
 stack local strings.

 I'm baffled.  Robert maybe something crazy is happening in RuleBasedCollator?


upon further investigation, i think it must be a JRE bug. for one, i
cannot (and was never able to) repro this locally.
for now i'd like to change the test to use randomSimpleString.
hopefully this is enough to dodge the bug!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

2011-05-02 Thread Matthias Pigulla (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027629#comment-13027629
]

Matthias Pigulla edited comment on SOLR-42 at 5/2/11 12:02 PM:
---

I don't think it's a duplicate and the issue is still unresolved at least in
regard to [#comment-12625835] and the 1.4.1 release.

The input string ??xx yy xx will have the start offsets for xx, yy and xx
at 3, 6 and 9 respectively and is off by one.

? ?? ?xx yy xx [spaces added between question marks for JIRA display]
will even have 6, 9 and 12, that is, every ?? (as a special degenerated
kind of XML PI) will shift the offset by one.

was (Author: mpdude):
I don't think it's a duplicate and the issue is still unresolved at least
in regard to [#comment-12625835] and the 1.4.1 release.

The input string ??xx yy xx will have the start offsets for xx, yy and xx
at 3, 6 and 9 respectively and is off by one.

xx yy xx will even have 6, 9 and 12, that is, every ?? (as a
special degenerated kind of XML PI) will shift the offset by one.

Highlighting problems with HTMLStripWhitespaceTokenizerFactory
--

Key: SOLR-42
URL: https://issues.apache.org/jira/browse/SOLR-42
Project: Solr
Issue Type: Bug
Components: highlighter
Reporter: Andrew May
Assignee: Grant Ingersoll
Priority: Minor
Attachments: HTMLStripReaderTest.java,
HtmlStripReaderTestXmlProcessing.patch,
HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch,
SOLR-42.patch, SOLR-42.patch, TokenPrinter.java, htmlStripReaderTest.html

Indexing content that contains HTML markup, causes problems with highlighting
if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names
from being searchable).
Example title field:
SUP40/SUPAr/SUP39/SUPAr laserprobe dating of mylonitic fabrics in a
polyorogenic terrane of NW Iberia
Searching for title:fabrics with highlighting on, the highlighted version has
the em tags in the wrong place - 22 characters to the left of where they
should be (i.e. the sum of the lengths of the tags).
Response from Yonik on the solr-user mailing-list:
HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.
I did it this way so that HTMLStripReader could go before any
tokenizer (like StandardTokenizer).
Can you open a JIRA bug for this? The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Index searcher can't find the doc of any field value

2011-05-02 Thread Erick Erickson

First, this kind of question is better suited for the Lucene User's list,
this list is intended for people actively developing the lucene code
itself.

That said, your problem most likely is that you are indexing your
fields UN_TOKENIZED, which means that the information isn't
split into words. Try using TOKENIZED.

By the way, what version are you using? UN_TOKENIZED has been
deprecated for quite some time.
You would probably get a lot of value from Luke..

Best
Erick

On Fri, Apr 29, 2011 at 10:44 PM, soheila dehghanzadeh
sally...@gmail.com wrote:
 Hi Friends,

 i'm using lucene to index a file with this format, each lines contains 4
 elements which separated by space. because I want to retrieve any line with
 special text in a special part, so I try to add each line to index in a
 seprate document with 4 fields. for example I named fields A,B,C,D

 so i use this code to index my file:

 File file = new File(e://data3);

     BufferedReader reader = new BufferedReader(new
 FileReader(file));

     IndexWriter writer = new IndexWriter(indexDirectory, new
 SimpleAnalyzer(),true);

     writer.setUseCompoundFile(true);

     String line;

     while ((line = reader.readLine()) != null) {

     string[] index = line.split( );

     Document document = new Document();

     document.add(new Field(A, index[0], Field.Store.YES,
 Field.Index.UN_TOKENIZED));

     document.add(new Field(B, index[1], Field.Store.YES,
 Field.Index.UN_TOKENIZED));

     document.add(new Field(C, index[2], Field.Store.YES,
 Field.Index.UN_TOKENIZED));

     document.add(new Field(D, index[3], Field.Store.YES,
 Field.Index.UN_TOKENIZED));

     writer.addDocument(document);

     System.out.println(writer.docCount());

     }

     } catch (Exception e) {

     e.printStackTrace();

     }

 but when i try to search this index with some letters which exist in for
 example field A it fails to find the document(line) :( my search code is as
 follows:

 try {

     IndexSearcher is = new
 IndexSearcher(FSDirectory.getDirectory(indexDirectory, false));

     Query q = new TermQuery(new Term(A, hello));

     Hits hits = is.search(q);

     for (int i = 0; i  hits.length(); i++) {

     Document doc = hits.doc(i);

     System.out.println(A: +doc.get(A)+ B:+doc.get(B)+
 C:+doc.get(C)+ D:+doc.get(D));

     }

     } catch (Exception e) {

     e.printStackTrace();

     }



 kindly let me know if there is any error in my code . thanks in advance.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation


[ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027631#comment-13027631
 ] 

Michael McCandless commented on LUCENE-3061:


I think they should be @experimental?  (Eg, MS itself is).

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch, LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

MergePolicy Thresholds

Hi

Today, LogMP allows you to set different thresholds for segments sizes,
thereby allowing you to control the largest segment that will be
considered for merge + the largest segment your index will hold (=~
threshold * mergeFactor).

So, if you want to end up w/ say 20GB segments, you can set
maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

However, this often does not achieve your desired goal -- if the index
contains 5 and 7 GB segments, they will never be merged b/c they are
bigger than the threshold. I am willing to spend the CPU and IO resources
to end up w/ 20 GB segments, whether I'm merging 10 segments together or
only 2. After I reach a 20GB segment, it can rest peacefully, at least
until I increase the threshold.

So I wonder, first, if this threshold (i.e., largest segment size you
would like to end up with) is more natural to set than thee current
thresholds,
from the application level? I.e., wouldn't it be a simpler threshold to set
instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
and mergeFactor?

Second, should this be an addition to LogMP, or a different
type of MP. One that adheres to only those two factors (perhaps the
segSize threshold should be allowed to set differently for optimize and
regular merges). It can pick segments for merge such that it maximizes
the result segment size (i.e., don't necessarily merge in sequential
order), but not more than mergeFactor.

I guess, if we think that maxResultSegmentSizeMB is more intuitive than
the current thresholds, application-wise, then this change should go
into LogMP. Otherwise, it feels like a different MP is needed, because
LogMP is already complicated and another threshold would confuse things.

What do you think of this? Am I trying to optimize too much? :)

Shai

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027640#comment-13027640
 ] 

Robert Muir commented on LUCENE-3054:
-

{quote}
I propose to change SorterTemplate to fall back to mergeSort once it checks 
that number of iterations grows e.g.  20 (have to test a little bit).
{quote}

I like the idea of some guard here to prevent the stack overflow, and 
hopefully keep the quickSort performance for the places where we know its 
better than mergesort.


 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Priority: Critical
 Attachments: LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054-stackoverflow.patch

Patch that shows the issue.

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Priority: Critical
 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027643#comment-13027643
 ] 

Uwe Schindler commented on LUCENE-3054:
---

As quicksort gets insanely slow when these type of data gets sorted, this also 
explains Otis' slowdown.

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Priority: Critical
 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3062) TestBytesRefHash#testCompact is broken

TestBytesRefHash#testCompact is broken
--

 Key: LUCENE-3062
 URL: https://issues.apache.org/jira/browse/LUCENE-3062
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0
 Attachments: LUCENE-3062.patch

TestBytesRefHash#testCompact fails when run with ant test 
-Dtestcase=TestBytesRefHash -Dtestmethod=testCompact 
-Dtests.seed=-7961072421643387492:5612141247152835360
{noformat}

[junit] Testsuite: org.apache.lucene.util.TestBytesRefHash
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash 
-Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360
[junit] NOTE: test params are: codec=PreFlex, locale=et, 
timezone=Pacific/Tahiti
[junit] NOTE: all tests run in this JVM:
[junit] [TestBytesRefHash]
[junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 1.6.0_24 
(64-bit)/cpus=12,threads=1,free=363421800,total=379322368
[junit] -  ---
[junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash): 
Caused an ERROR
[junit] bitIndex  0: -27
[junit] java.lang.IndexOutOfBoundsException: bitIndex  0: -27
[junit] at java.util.BitSet.set(BitSet.java:262)
[junit] at 
org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)
[junit] 
[junit] 
[junit] Test org.apache.lucene.util.TestBytesRefHash FAILED
{noformat}

the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); 
will never return the same string.

I will upload a patch soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3062) TestBytesRefHash#testCompact is broken


 [ 
https://issues.apache.org/jira/browse/LUCENE-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3062:


Attachment: LUCENE-3062.patch

here is a patch

 TestBytesRefHash#testCompact is broken
 --

 Key: LUCENE-3062
 URL: https://issues.apache.org/jira/browse/LUCENE-3062
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3062.patch


 TestBytesRefHash#testCompact fails when run with ant test 
 -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact 
 -Dtests.seed=-7961072421643387492:5612141247152835360
 {noformat}
 [junit] Testsuite: org.apache.lucene.util.TestBytesRefHash
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash 
 -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360
 [junit] NOTE: test params are: codec=PreFlex, locale=et, 
 timezone=Pacific/Tahiti
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestBytesRefHash]
 [junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 
 1.6.0_24 (64-bit)/cpus=12,threads=1,free=363421800,total=379322368
 [junit] -  ---
 [junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash):   
 Caused an ERROR
 [junit] bitIndex  0: -27
 [junit] java.lang.IndexOutOfBoundsException: bitIndex  0: -27
 [junit]   at java.util.BitSet.set(BitSet.java:262)
 [junit]   at 
 org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.TestBytesRefHash FAILED
 {noformat}
 the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); 
 will never return the same string.
 I will upload a patch soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot

Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
 Hi

 Today, LogMP allows you to set different thresholds for segments sizes,
 thereby allowing you to control the largest segment that will be
 considered for merge + the largest segment your index will hold (=~
 threshold * mergeFactor).

 So, if you want to end up w/ say 20GB segments, you can set
 maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

 However, this often does not achieve your desired goal -- if the index
 contains 5 and 7 GB segments, they will never be merged b/c they are
 bigger than the threshold. I am willing to spend the CPU and IO resources
 to end up w/ 20 GB segments, whether I'm merging 10 segments together or
 only 2. After I reach a 20GB segment, it can rest peacefully, at least
 until I increase the threshold.

 So I wonder, first, if this threshold (i.e., largest segment size you
 would like to end up with) is more natural to set than thee current
 thresholds,
 from the application level? I.e., wouldn't it be a simpler threshold to set
 instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
 and mergeFactor?

 Second, should this be an addition to LogMP, or a different
 type of MP. One that adheres to only those two factors (perhaps the
 segSize threshold should be allowed to set differently for optimize and
 regular merges). It can pick segments for merge such that it maximizes
 the result segment size (i.e., don't necessarily merge in sequential
 order), but not more than mergeFactor.

 I guess, if we think that maxResultSegmentSizeMB is more intuitive than
 the current thresholds, application-wise, then this change should go
 into LogMP. Otherwise, it feels like a different MP is needed, because
 LogMP is already complicated and another threshold would confuse things.

 What do you think of this? Am I trying to optimize too much? :)

 Shai





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-3054:
-

Assignee: Uwe Schindler

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

I did look at it, but I didn't find that it answers this particular need
(ending with a segment no bigger than X). Perhaps by tweaking several
parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
something, but it's not very clear what is the right combination.

Which is related to one of the points -- is it not more intuitive for an app
to set this threshold (if it needs any thresholds), than tweaking all of
those parameters? If so, then we only need two thresholds (size +
mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
(perhaps w/ some adaptations) to derive a merge plan.

Shai

On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

 On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
 set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027662#comment-13027662
 ] 

Uwe Schindler commented on LUCENE-3054:
---

Due to the realtime merge (LUCENE-3023), suddenly DocFieldProcessor got a 
reincarnation of quicksort again... will remove, too

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation


 [ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-3061.


   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [New])

Committed revision 1098543 (3x).
Committed revision 1098576 (trunk).

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch, LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3023) Land DWPT on trunk


 [ 
https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3023:
--

Attachment: LUCENE-3023-quicksort-reincarnation.patch

Here the patch. Will commit soon.

 Land DWPT on trunk
 --

 Key: LUCENE-3023
 URL: https://issues.apache.org/jira/browse/LUCENE-3023
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3023-quicksort-reincarnation.patch, 
 LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, 
 LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, 
 LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, 
 LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, 
 LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, 
 diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, 
 realtime-TestAddIndexes-5.txt, 
 realtime-TestIndexWriterExceptions-assert-6.txt, 
 realtime-TestIndexWriterExceptions-npe-1.txt, 
 realtime-TestIndexWriterExceptions-npe-2.txt, 
 realtime-TestIndexWriterExceptions-npe-4.txt, 
 realtime-TestOmitTf-corrupt-0.txt


 With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so 
 we can proceed landing the DWPT development on trunk soon. I think one of the 
 bigger issues here is to make sure that all JavaDocs for IW etc. are still 
 correct though. I will start going through that first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-3023) Land DWPT on trunk


 [ 
https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened LUCENE-3023:
---


I reopen this one, as the merge added a reincarnation of quicksort in 
DocFieldProcessor (which was previously removed in the corresponding *PerThread 
class, but lost during the merge).

I will fix soon.

 Land DWPT on trunk
 --

 Key: LUCENE-3023
 URL: https://issues.apache.org/jira/browse/LUCENE-3023
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3023-quicksort-reincarnation.patch, 
 LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, 
 LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, 
 LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, 
 LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, 
 LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, 
 diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, 
 realtime-TestAddIndexes-5.txt, 
 realtime-TestIndexWriterExceptions-assert-6.txt, 
 realtime-TestIndexWriterExceptions-npe-1.txt, 
 realtime-TestIndexWriterExceptions-npe-2.txt, 
 realtime-TestIndexWriterExceptions-npe-4.txt, 
 realtime-TestOmitTf-corrupt-0.txt


 With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so 
 we can proceed landing the DWPT development on trunk soon. I think one of the 
 bigger issues here is to make sure that all JavaDocs for IW etc. are still 
 correct though. I will start going through that first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot

Dunno, I'm quite happy with numLargeSegments (you critically
misspelled it). It neatly avoids uber-merges, keeps the number of
segments at bay, and does not require to recalculate thresholds when
my expected index size changes.

The problem is - each person needs his own set of knobs (or thinks he
needs them) for MergePolicy, and I can't call any of these sets
superior to others :/

2011/5/2 Shai Erera ser...@gmail.com:
 I did look at it, but I didn't find that it answers this particular need
 (ending with a segment no bigger than X). Perhaps by tweaking several
 parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
 something, but it's not very clear what is the right combination.

 Which is related to one of the points -- is it not more intuitive for an app
 to set this threshold (if it needs any thresholds), than tweaking all of
 those parameters? If so, then we only need two thresholds (size +
 mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
 (perhaps w/ some adaptations) to derive a merge plan.

 Shai

 On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

 On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO
  resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
  set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3023) Land DWPT on trunk


 [ 
https://issues.apache.org/jira/browse/LUCENE-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-3023.
---

Resolution: Fixed

Removed quicksort in revision 1098592

 Land DWPT on trunk
 --

 Key: LUCENE-3023
 URL: https://issues.apache.org/jira/browse/LUCENE-3023
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3023-quicksort-reincarnation.patch, 
 LUCENE-3023-svn-diff.patch, LUCENE-3023-ws-changes.patch, LUCENE-3023.patch, 
 LUCENE-3023.patch, LUCENE-3023.patch, LUCENE-3023.patch, 
 LUCENE-3023_CHANGES.patch, LUCENE-3023_CHANGES.patch, 
 LUCENE-3023_iw_iwc_jdoc.patch, LUCENE-3023_simonw_review.patch, 
 LUCENE-3023_svndiff.patch, LUCENE-3023_svndiff.patch, diffMccand.py, 
 diffSources.patch, diffSources.patch, realtime-TestAddIndexes-3.txt, 
 realtime-TestAddIndexes-5.txt, 
 realtime-TestIndexWriterExceptions-assert-6.txt, 
 realtime-TestIndexWriterExceptions-npe-1.txt, 
 realtime-TestIndexWriterExceptions-npe-2.txt, 
 realtime-TestIndexWriterExceptions-npe-4.txt, 
 realtime-TestOmitTf-corrupt-0.txt


 With LUCENE-2956 we have resolved the last remaining issue for LUCENE-2324 so 
 we can proceed landing the DWPT development on trunk soon. I think one of the 
 bigger issues here is to make sure that all JavaDocs for IW etc. are still 
 correct though. I will start going through that first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds


 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/


I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

It neatly avoids uber-merges


I didn't see that I can define what uber-merge is, right? Can I tell it to
stop merging segments of some size? E.g., if my index grew to 100 segments,
40GB each, I don't think that merging 10 40GB segments (to create 400GB
segment) is going to speed up my search, for instance. A 40GB segment
(probably much less) is already big enough to not be touched anymore.

Will BalancedMP stop merging such segments (if all segments are of that
order of magnitude)?

Shai

On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Dunno, I'm quite happy with numLargeSegments (you critically
 misspelled it). It neatly avoids uber-merges, keeps the number of
 segments at bay, and does not require to recalculate thresholds when
 my expected index size changes.

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 2011/5/2 Shai Erera ser...@gmail.com:
  I did look at it, but I didn't find that it answers this particular need
  (ending with a segment no bigger than X). Perhaps by tweaking several
  parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
 achieve
  something, but it's not very clear what is the right combination.
 
  Which is related to one of the points -- is it not more intuitive for an
 app
  to set this threshold (if it needs any thresholds), than tweaking all of
  those parameters? If so, then we only need two thresholds (size +
  mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
  (perhaps w/ some adaptations) to derive a merge plan.
 
  Shai
 
  On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com
 wrote:
 
  Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
 
  On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Today, LogMP allows you to set different thresholds for segments
 sizes,
   thereby allowing you to control the largest segment that will be
   considered for merge + the largest segment your index will hold (=~
   threshold * mergeFactor).
  
   So, if you want to end up w/ say 20GB segments, you can set
   maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
  
   However, this often does not achieve your desired goal -- if the index
   contains 5 and 7 GB segments, they will never be merged b/c they are
   bigger than the threshold. I am willing to spend the CPU and IO
   resources
   to end up w/ 20 GB segments, whether I'm merging 10 segments together
 or
   only 2. After I reach a 20GB segment, it can rest peacefully, at least
   until I increase the threshold.
  
   So I wonder, first, if this threshold (i.e., largest segment size you
   would like to end up with) is more natural to set than thee current
   thresholds,
   from the application level? I.e., wouldn't it be a simpler threshold
 to
   set
   instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
   and mergeFactor?
  
   Second, should this be an addition to LogMP, or a different
   type of MP. One that adheres to only those two factors (perhaps the
   segSize threshold should be allowed to set differently for optimize
 and
   regular merges). It can pick segments for merge such that it maximizes
   the result segment size (i.e., don't necessarily merge in sequential
   order), but not more than mergeFactor.
  
   I guess, if we think that maxResultSegmentSizeMB is more intuitive
 than
   the current thresholds, application-wise, then this change should go
   into LogMP. Otherwise, it feels like a different MP is needed, because
   LogMP is already complicated and another threshold would confuse
 things.
  
   What do you think of this? Am I trying to optimize too much? :)
  
   Shai
  
  
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054.patch

Here the patch that combines Robert's optimization for PhraseQuery (term with 
lower docFreq will also have less positions) and the safety for quickSort at 
all.

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Fix Version/s: 4.0
   3.2
   3.1.1

Set fix versions (also backport to 3.1.1, as its serious for some large 
PhraseQueries and a serious slowdown then).

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054.patch

Sorry, the safety net is only needed at 40 (from my tests), before it may 
affect BytesRefHash performance.

I will commit later!

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054.patch

Better test that fails faster in case of quickSort bug

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[Lucene.Net] fw: resolving github mirror issues

2011-05-02 Thread Michael Herndon

Is there any reason not to replace the old mirror with the newly created
one?

- Michael
--

Hi,

On Tue, Apr 26, 2011 at 7:51 PM, Michael Herndon
mhern...@wickedsoftware.net wrote:
 Would it be possible to get the git mirror to reflect that or at least
 create a new mirror for the lucene.net repo that is under incubator?

Unfortunately our mirroring scripts can't handle an svn move that
wasn't done as a single commit (svn move .../lucene/lucene.net
.../incubator/lucene.net), so I'll need to recreate the mirror. If and
when you move back to Lucene or to a TLP, I suggest you move the full
svn tree in a single commit.

Do you still need the old mirror repository, or is it OK if I simply
replace it with the newly created one?

BR,

Jukka Zitting

[jira] [Resolved] (LUCENE-3062) TestBytesRefHash#testCompact is broken


 [ 
https://issues.apache.org/jira/browse/LUCENE-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-3062.
-

Resolution: Fixed

 TestBytesRefHash#testCompact is broken
 --

 Key: LUCENE-3062
 URL: https://issues.apache.org/jira/browse/LUCENE-3062
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3062.patch


 TestBytesRefHash#testCompact fails when run with ant test 
 -Dtestcase=TestBytesRefHash -Dtestmethod=testCompact 
 -Dtests.seed=-7961072421643387492:5612141247152835360
 {noformat}
 [junit] Testsuite: org.apache.lucene.util.TestBytesRefHash
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.454 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestBytesRefHash 
 -Dtestmethod=testCompact -Dtests.seed=-7961072421643387492:5612141247152835360
 [junit] NOTE: test params are: codec=PreFlex, locale=et, 
 timezone=Pacific/Tahiti
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestBytesRefHash]
 [junit] NOTE: Linux 2.6.35-28-generic amd64/Sun Microsystems Inc. 
 1.6.0_24 (64-bit)/cpus=12,threads=1,free=363421800,total=379322368
 [junit] -  ---
 [junit] Testcase: testCompact(org.apache.lucene.util.TestBytesRefHash):   
 Caused an ERROR
 [junit] bitIndex  0: -27
 [junit] java.lang.IndexOutOfBoundsException: bitIndex  0: -27
 [junit]   at java.util.BitSet.set(BitSet.java:262)
 [junit]   at 
 org.apache.lucene.util.TestBytesRefHash.testCompact(TestBytesRefHash.java:146)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit]   at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.TestBytesRefHash FAILED
 {noformat}
 the test expects that _TestUtil.randomRealisticUnicodeString(random, 1000); 
 will never return the same string.
 I will upload a patch soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

Actually the new TieredMergePolicy (only on trunk currently but I plan
to backport for 3.2) lets you set the max merged segment size
(maxMergedSegmentMB).

It's only an estimate, but if it's set, it tries to pick a merge
reaching around that target size.

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 Today, LogMP allows you to set different thresholds for segments sizes,
 thereby allowing you to control the largest segment that will be
 considered for merge + the largest segment your index will hold (=~
 threshold * mergeFactor).

 So, if you want to end up w/ say 20GB segments, you can set
 maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

 However, this often does not achieve your desired goal -- if the index
 contains 5 and 7 GB segments, they will never be merged b/c they are
 bigger than the threshold. I am willing to spend the CPU and IO resources
 to end up w/ 20 GB segments, whether I'm merging 10 segments together or
 only 2. After I reach a 20GB segment, it can rest peacefully, at least
 until I increase the threshold.

 So I wonder, first, if this threshold (i.e., largest segment size you
 would like to end up with) is more natural to set than thee current
 thresholds,
 from the application level? I.e., wouldn't it be a simpler threshold to set
 instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
 and mergeFactor?

 Second, should this be an addition to LogMP, or a different
 type of MP. One that adheres to only those two factors (perhaps the
 segSize threshold should be allowed to set differently for optimize and
 regular merges). It can pick segments for merge such that it maximizes
 the result segment size (i.e., don't necessarily merge in sequential
 order), but not more than mergeFactor.

 I guess, if we think that maxResultSegmentSizeMB is more intuitive than
 the current thresholds, application-wise, then this change should go
 into LogMP. Otherwise, it feels like a different MP is needed, because
 LogMP is already complicated and another threshold would confuse things.

 What do you think of this? Am I trying to optimize too much? :)

 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054.patch

Final patch.

After some discussion with robert: The use of QuickSort is fine after the 
comparator was fixed to not only sort by docFreq.

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027702#comment-13027702
 ] 

Uwe Schindler commented on LUCENE-3054:
---

Committed trunk revision: 1098633

Now merging...

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-3054.
---

Resolution: Fixed

Merged 3.x revision: 1098639
Merged 3.1 revision: 1098641

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module


 [ 
https://issues.apache.org/jira/browse/LUCENE-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3063:


Attachment: LUCENE-3063.patch

 factor CharTokenizer/CharacterUtils into analyzers module
 -

 Key: LUCENE-3063
 URL: https://issues.apache.org/jira/browse/LUCENE-3063
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3063.patch


 Currently these analysis components are in the lucene core, but should really
 be .util in the analyzers module.
 Also, with MockTokenizer extending Tokenizer directly, we can add some 
 additional
 checks in the future to try to ensure our consumers are being good consumers 
 (e.g. calling reset).
 This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't 
 implement it here yet,
 this is just the factoring. I think we should try to do this before 
 LUCENE-3040.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module

factor CharTokenizer/CharacterUtils into analyzers module
-

 Key: LUCENE-3063
 URL: https://issues.apache.org/jira/browse/LUCENE-3063
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0
 Attachments: LUCENE-3063.patch

Currently these analysis components are in the lucene core, but should really
be .util in the analyzers module.

Also, with MockTokenizer extending Tokenizer directly, we can add some 
additional
checks in the future to try to ensure our consumers are being good consumers 
(e.g. calling reset).

This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't 
implement it here yet,
this is just the factoring. I think we should try to do this before LUCENE-3040.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays

[
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless reopened LUCENE-3054:

Reopening so we can discuss things further...:

QuickSort is dangerous! Yet, it's definitely faster than MergeSort
for some cases (~20% faster when sorting terms for writing segment, in
quick test I ran on Wikipedia content).

So the core issue is we should not use QS when there's a risk of any
ties, because in that case it can run really slowly or hit infinite
recursion.

And we (well, Otis; thank you!) found one such place today (where
MultiPhraseQuery sorts its terms) where we could have many ties and
thus run very slowly / hit stack overflow.

I appreciate the motivation for the safety net, but, it makes me
nervous... because, say we had done this a few months back... then
Otis likely would not have reported the issue? Ie, the
MultiPhraseQuery would run slowly... which could evade detection
(people may just think it's slow).

I prefer brittle fails over silent slowdowns because the brittle fail
gets your attention and you get a real fix in. Silent slowdowns evade
detection. Sort of like the difference between a virus and
spyware...

Also, what's preventing us from accidentally using QS somewhere in the
future, where we shouldn't? What's going to catch us?

Robert's first patch would catch this and protect us going forward?

Or, maybe we could strengthen that approach and assert cmp != 0
inside QS (ie, no ties are allowed to be passed to QS)?

Though, using asserts only is risky, because it could be the
comparator may return 0, but it's just that none of our test cases
tickled it.

Maybe instead we could do this in a type-safe way: make a new
NoTiesComparator whose compare method can only return LESS_THAN or
GREATER_THAN? And then QS would require NoTiesComparator. Could that
work?

SorterTemplate.quickSort stack overflows on broken comparators that produce
only few disticnt values in large arrays

Key: LUCENE-3054
URL: https://issues.apache.org/jira/browse/LUCENE-3054
Project: Lucene - Java
Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
Fix For: 3.1.1, 3.2, 4.0

Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch,
LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch,
LUCENE-3054.patch

Looking at Otis's sort problem on the mailing list, he said:
{noformat}
* looked for other places where this call is made - found it in
MultiPhraseQuery$MultiPhraseWeight and changed that call from
ArrayUtil.quickSort to ArrayUtil.mergeSort
* now we no longer see SorterTemplate.quickSort in deep recursion when we do a
thread dump
{noformat}
I thought this was interesting because PostingsAndFreq's comparator
looks like it needs a tiebreaker.
I think in our sorts we should add some asserts to try to catch some of these
broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027730#comment-13027730
 ] 

Michael McCandless commented on LUCENE-3054:


Also, I think PQ.PostingsAndFreq.compare is still able to return ties, if the 
app puts the same term at the same position (which is a silly thing to do... 
but, still possible).

I think instead of disambiguating by Term, we should disambiguate by ord (ie, 
position of this term in the array of the query itself), since that can never 
be the same for entries in the array?

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
way, or do you think it can easily be ported to 3x?

Shai

On Mon, May 2, 2011 at 6:34 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Actually the new TieredMergePolicy (only on trunk currently but I plan
 to backport for 3.2) lets you set the max merged segment size
 (maxMergedSegmentMB).

 It's only an estimate, but if it's set, it tries to pick a merge
 reaching around that target size.

 Mike

 http://blog.mikemccandless.com

 On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
 set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
 Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
 way, or do you think it can easily be ported to 3x?
 Shai

 On Mon, May 2, 2011 at 6:34 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Actually the new TieredMergePolicy (only on trunk currently but I plan
 to backport for 3.2) lets you set the max merged segment size
 (maxMergedSegmentMB).

 It's only an estimate, but if it's set, it tries to pick a merge
 reaching around that target size.

 Mike

 http://blog.mikemccandless.com

 On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO
  resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
  set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2945) Surround Query doesn't properly handle equals/hashcode

2011-05-02 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027755#comment-13027755
 ] 

Paul Elschot commented on LUCENE-2945:
--

Does the latest patch solve the original problem as expected?

 Surround Query doesn't properly handle equals/hashcode
 --

 Key: LUCENE-2945
 URL: https://issues.apache.org/jira/browse/LUCENE-2945
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0.3, 3.1, 4.0
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1.1, 4.0

 Attachments: LUCENE-2945-partial1.patch, LUCENE-2945.patch, 
 LUCENE-2945.patch, LUCENE-2945.patch, LUCENE-2945c.patch, LUCENE-2945d.patch, 
 LUCENE-2945d.patch


 In looking at using the surround queries with Solr, I am hitting issues 
 caused by collisions due to equals/hashcode not being implemented on the 
 anonymous inner classes that are created by things like DistanceQuery (branch 
 3.x, near line 76)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: MergePolicy Thresholds

2011-05-02 Thread Burton-West, Tom

Hi Shai and Mike,

Testing the TieredMP on our large indexes has been on my todo list since I read 
Mikes blog post
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

If you port it to the 3.x branch Shai, I'll be more than happy to test it with 
our very large (300GB+) indexes.  Besides being able to set the max merged 
segment size, I'm especially interested in using the  maxSegmentsPerTier 
parameter.

From Mike's blog post:
 ...maxSegmentsPerTier that lets you set the allowed width (number of 
segments) of each stair in the staircase. This is nice because it decouples how 
many segments to merge at a time from how wide the staircase can be.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, May 02, 2011 2:19 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
 Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
 way, or do you think it can easily be ported to 3x?
 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays

2011-05-02 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027772#comment-13027772
 ] 

Dawid Weiss commented on LUCENE-3054:
-

I'm sure many of you know this, but there is a new implementation of mergesort 
in java.util.Collections -- it is based on a few clever heuristics (so it is a 
merge sort, only a finely tuned one) and has been ported/ partially inspired by 
the sort in Python as far as I recall.

Maybe it'd be sensible to compare against this and see what happens. I know 
Lucene/Solr would rather have its own implementation so that it doesn't rely on 
the standard library, but in my benchmarks the implementation in 
Collections.sort() was hard to beat...

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027774#comment-13027774
 ] 

Uwe Schindler commented on LUCENE-3054:
---

Dawid:
There are two problems we have seen with native sort:
- it copies the array/collection always first, this caused slowdown for lots of 
places especiall in automaton - so it never sorts in plcace
- we sometimes need to sort multiple arrays in parallel, one as sort key - 
especially in TermsHash/BytesRefHash. This is where SorterTemplate comes into 
the game: it supports separate swap(i,j) and compare(i,j) operations.

Uwe

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 7659 - Failure

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7659/

1 tests failed.
REGRESSION:  org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple

Error Message:
expected:3 but was:2

Stack Trace:
junit.framework.AssertionFailedError: expected:3 but was:2
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1112)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1040)
at 
org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple(TestLBHttpSolrServer.java:127)




Build Log (for compile errors):
[...truncated 10762 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3059) PulsingTermState.clone leaks memory


 [ 
https://issues.apache.org/jira/browse/LUCENE-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3059.


Resolution: Fixed

 PulsingTermState.clone leaks memory
 ---

 Key: LUCENE-3059
 URL: https://issues.apache.org/jira/browse/LUCENE-3059
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-3059.patch


 I looked at the heap dump from the OOME this morning (thank you Uwe
 for turning this on!), and I think it's a real memory leak.
 Well, not really a leak; rather, the cloned PulsingTermState, which we
 cache in the terms dict cache, is hanging onto large byte[]
 unnecessarily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays

2011-05-02 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027780#comment-13027780
 ] 

Dawid Weiss commented on LUCENE-3054:
-

Thanks Uwe, I didn't know about it. Still, the algorithm folks developing 
OpenJDK have implemented is public, so an improvement can be filed -- maybe 
somebody will find the time to implement it in a version suitable for Lucene.

http://en.wikipedia.org/wiki/Timsort

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Link to nightly build test reports on main Lucene site needs updating

2011-05-02 Thread Burton-West, Tom

Thanks for fixing++

Tom

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Sunday, May 01, 2011 6:05 AM
To: dev@lucene.apache.org; simon.willna...@gmail.com; 
java-u...@lucene.apache.org
Subject: RE: Link to nightly build test reports on main Lucene site needs 
updating

I fixed the nightly docs, once the webserver mirrors them from SVN they should 
appear. The developer-resources page was completely broken. It now also 
contains references to the stable 3.x branch as most users would prefer that 
one to fix latest bugs but don’t want to have a backwards-incompatible version.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

[jira] [Resolved] (SOLR-2467) Custom analyzer load exceptions are not logged.

2011-05-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-2467.


   Resolution: Fixed
Fix Version/s: 4.0
   3.2

Thanks for reporting this Alex


Committed revision 1098760.
Committed revision 1098764.


 Custom analyzer load exceptions are not logged.
 ---

 Key: SOLR-2467
 URL: https://issues.apache.org/jira/browse/SOLR-2467
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.1
Reporter: Alexander Kistanov
Priority: Minor
 Fix For: 3.2, 4.0


 If any exception occurred on custom analyzer load the following catch code is 
 working:
 {code:title=solr/src/java/org/apache/solr/schema/IndexSchema.java}
   } catch (Exception e) {
 throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
   Cannot load analyzer: +analyzerName );
   }
 {code}
 Analyzer load exception e is not logged at all.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027808#comment-13027808
 ] 

Michael McCandless commented on LUCENE-3054:


So, there are two known improvements to our QS, to try to avoid the O(N^2)
worst-case, both from Robert Sedgewick.

First, it's better to select median of low/mid/high as the pivot
(http://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot).  Second, we
should handle equal values better
(http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm#Duplicates).

See also Lucy's nice QS impl:

  
http://svn.apache.org/viewvc/incubator/lucy/trunk/core/Lucy/Util/SortUtils.c?revision=1098445view=markup#l331

which I think addresses the above two issues, and goes even further
(eq-to-pivot values are explicitly moved to the middle and then not
recursed on).

The thing is, fixing these will make our QS more general, at the
expense of some added cost for the cases we know work fine today (eg
sorting terms before flushing a segment).

Maybe we leave our QS as is (except, changing the 40 to be dynamic
depending on input length), noting that you should not use it if your
comparator does not break ties, and even if it does there are still
risks because of potentially bad pivot selection?

Or, maybe we remove QS always use MS?  Yes, there's a hit to the sort
when flushing the segment, but this is a tiny cost compared to the
rest of segment flushing...

Separately we can look into whether the tool timsort is faster for
sorting terms for flush

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3029) MultiPhraseQuery assigns different scores to identical docs when using 0 pos-incr


 [ 
https://issues.apache.org/jira/browse/LUCENE-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3029.


Resolution: Fixed

 MultiPhraseQuery assigns different scores to identical docs when using 0 
 pos-incr
 -

 Key: LUCENE-3029
 URL: https://issues.apache.org/jira/browse/LUCENE-3029
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.0.4, 3.2, 4.0

 Attachments: LUCENE-3029.patch


 If you have two identical docs with tokens a b c all zero pos-incr (ie
 they occur on the same position), and you run a MultiPhraseQuery with
 [a, b] and [c] (all pos incr 0)... then the two docs will get
 different scores despite being identical.
 Admittedly it's a strange query... but I think the scorer ought to
 count the phrase as having tf=1 for each doc.
 The problem is that we are missing a tie-breaker for the PhraseQuery
 used by ExactPhraseScorer, and so the PQ ends up flip/flopping such
 that every other document gets the same score.  Ie, even docIDs all
 get one score and odd docIDs all get another score.
 Once I added the hard tie-breaker (ord) the scores are the same.
 However... there's a separate bug, that can over-count the tf, such
 that if I create the MPQ like this:
 {noformat}
   mpq.add(new Term[] {new Term(field, a)}, 0);
   mpq.add(new Term[] {new Term(field, b), new Term(field, c)}, 0);
 {noformat}
 I get tf=2 per doc, but if I create it like this:
 {noformat}
   mpq.add(new Term[] {new Term(field, b), new Term(field, c)}, 0);
   mpq.add(new Term[] {new Term(field, a)}, 0);
 {noformat}
 I get tf=1 (which I think is correct?).
 This happens because MultipleTermPositions freely returns the same
 position more than once: it just unions the positions of the two
 streams, so when both have their term at pos=0, you'll get pos=0
 twice, which is not good and leads to over-counting tf.
 Unfortunately, I don't see a performant way to fix that... and I'm not
 sure that it really matters that much in practice.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2484) Make SynonymFilterFactory more extendable


 [ 
https://issues.apache.org/jira/browse/SOLR-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-2484:


Attachment: SOLR-2484-SynonymFilterFactory.patch

patch with a simple test

 Make SynonymFilterFactory more extendable
 -

 Key: SOLR-2484
 URL: https://issues.apache.org/jira/browse/SOLR-2484
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Fix For: 3.2, 4.0

 Attachments: SOLR-2484-SynonymFilterFactory.patch


 As is, reading rules from the ResourceLoader is baked into 
 inform(ResourceLoader loader).  It would be nice to be able to load custom 
 rules w/o needing to rewrite the whole thing.  This issue changes two things:
 # Changes ListString to IterableString because we don't really need a list
 # adds protected IterableString loadRules( String synonyms, ResourceLoader 
 loader ) -- so subclasses could fill their own

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2484) Make SynonymFilterFactory more extendable

Make SynonymFilterFactory more extendable
-

 Key: SOLR-2484
 URL: https://issues.apache.org/jira/browse/SOLR-2484
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Fix For: 3.2, 4.0
 Attachments: SOLR-2484-SynonymFilterFactory.patch

As is, reading rules from the ResourceLoader is baked into 
inform(ResourceLoader loader).  It would be nice to be able to load custom 
rules w/o needing to rewrite the whole thing.  This issue changes two things:
# Changes ListString to IterableString because we don't really need a list
# adds protected IterableString loadRules( String synonyms, ResourceLoader 
loader ) -- so subclasses could fill their own

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2399) Solr Admin Interface, reworked

2011-05-02 Thread Stefan Matheis (steffkes) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Matheis (steffkes) updated SOLR-2399:


Description: 
*The idea was to create a new, fresh (and hopefully clean) Solr Admin 
Interface.* [Based on this 
[ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]

I've quickly created a Github-Repository (Just for me, to keep track of the 
changes)
» https://github.com/steffkes/solr-admin 

Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
[Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
[Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
[Logging|http://files.mathe.is/solr-admin/07_logging.png], 
[Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
[Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
[Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], 
[Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png]

Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI

  was:
*The idea was to create a new, fresh (and hopefully clean) Solr Admin 
Interface.* [Based on this 
[ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]

I've quickly created a Github-Repository (Just for me, to keep track of the 
changes)
» https://github.com/steffkes/solr-admin 

Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
[Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
[Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
[Logging|http://files.mathe.is/solr-admin/07_logging.png], 
[Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
[Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
[Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png]

Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI


 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.0


 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
 [Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
 [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
 [Logging|http://files.mathe.is/solr-admin/07_logging.png], 
 [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
 [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
 [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], 
 [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png]
 Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked

2011-05-02 Thread Stefan Matheis (steffkes) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027846#comment-13027846
 ] 

Stefan Matheis (steffkes) commented on SOLR-2399:
-

Just because i had a quick Idea for it this morning -- the [Core-Admin 
Screen|http://files.mathe.is/solr-admin/09_coreadmin.png].

Add Core will open an additional Layer within a Form where you could type all 
required informations. Actually the Functionality for the Buttons is missing, 
will be added tomorrow

 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.0


 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
 [Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
 [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
 [Logging|http://files.mathe.is/solr-admin/07_logging.png], 
 [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
 [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
 [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png]
 Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked

2011-05-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027850#comment-13027850
 ] 

Otis Gospodnetic commented on SOLR-2399:


Thanks for doing all this, Stefan!

I looked at the Analysis screenshot and found it a bit hard to eyeball 
quickly because the whole things feels very pale, which makes it hard for an 
eye to quickly jump from tokenizer, to token filter, to next token filter, etc. 
 It's also not immediately obvious what left side vs. right side are, so maybe 
a more visible Index-time Analysis and Query-time Analysis may help.


 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.0


 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
 [Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
 [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
 [Logging|http://files.mathe.is/solr-admin/07_logging.png], 
 [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
 [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
 [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], 
 [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png]
 Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2485) Remove BaseResponseWriter, GenericBinaryResponseWriter, and GenericTextResponseWriter

Remove BaseResponseWriter, GenericBinaryResponseWriter, and 
GenericTextResponseWriter
-

 Key: SOLR-2485
 URL: https://issues.apache.org/jira/browse/SOLR-2485
 Project: Solr
  Issue Type: Task
  Components: Response Writers
Reporter: Ryan McKinley
 Fix For: 4.0


In SOLR-1566, we dramatically refactored the response writer framework -- 
BaseResponseWriter, GenericBinaryResponseWriter, and GenericTextResponseWriter 
got left out in the cold because they don't have any tests and it is unclear 
how they are supposed to work.  With the new refactoring, I think the goals of 
these classes are better supported by extending BinaryResponseWriter and 
TextResponseWriter.

in 3.x, these classes  should be deprecated and suggest using 
BinaryResponseWriter and TextResponseWriter

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

 It neatly avoids uber-merges

 I didn't see that I can define what uber-merge is, right? Can I tell it to
 stop merging segments of some size? E.g., if my index grew to 100 segments,
 40GB each, I don't think that merging 10 40GB segments (to create 400GB
 segment) is going to speed up my search, for instance. A 40GB segment
 (probably much less) is already big enough to not be touched anymore.
No, you can't. But you can tell it to have exactly (not 'at most') N
top-tier segments and try to keep their sizes close with merges.
Whatever that size may be.
And this is exactly what I want. And defining max cap on segment size
is not what I want.

So the same set of knobs can be intuitive and meaningful for one
person, and useless for another. And you can't pick the best one.

 Will BalancedMP stop merging such segments (if all segments are of that
 order of magnitude)?

 Shai

 On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Dunno, I'm quite happy with numLargeSegments (you critically
 misspelled it). It neatly avoids uber-merges, keeps the number of
 segments at bay, and does not require to recalculate thresholds when
 my expected index size changes.

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 2011/5/2 Shai Erera ser...@gmail.com:
  I did look at it, but I didn't find that it answers this particular need
  (ending with a segment no bigger than X). Perhaps by tweaking several
  parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
  achieve
  something, but it's not very clear what is the right combination.
 
  Which is related to one of the points -- is it not more intuitive for an
  app
  to set this threshold (if it needs any thresholds), than tweaking all of
  those parameters? If so, then we only need two thresholds (size +
  mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
  (perhaps w/ some adaptations) to derive a merge plan.
 
  Shai
 
  On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com
  wrote:
 
  Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
 
  On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Today, LogMP allows you to set different thresholds for segments
   sizes,
   thereby allowing you to control the largest segment that will be
   considered for merge + the largest segment your index will hold (=~
   threshold * mergeFactor).
  
   So, if you want to end up w/ say 20GB segments, you can set
   maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
  
   However, this often does not achieve your desired goal -- if the
   index
   contains 5 and 7 GB segments, they will never be merged b/c they are
   bigger than the threshold. I am willing to spend the CPU and IO
   resources
   to end up w/ 20 GB segments, whether I'm merging 10 segments together
   or
   only 2. After I reach a 20GB segment, it can rest peacefully, at
   least
   until I increase the threshold.
  
   So I wonder, first, if this threshold (i.e., largest segment size you
   would like to end up with) is more natural to set than thee current
   thresholds,
   from the application level? I.e., wouldn't it be a simpler threshold
   to
   set
   instead of doing weird calculus that depend on
   maxMergeMB(ForOptimize)
   and mergeFactor?
  
   Second, should this be an addition to LogMP, or a different
   type of MP. One that adheres to only those two factors (perhaps the
   segSize threshold should be allowed to set differently for optimize
   and
   regular merges). It can pick segments for merge such that it
   maximizes
   the result segment size (i.e., don't necessarily merge in sequential
   order), but not more than mergeFactor.
  
   I guess, if we think that maxResultSegmentSizeMB is more intuitive
   than
   the current thresholds, application-wise, then this change should go
   into LogMP. Otherwise, it feels like a different MP is needed,
   because
   LogMP is already complicated and another threshold would confuse
   things.
  
   What do you think of this? Am I trying to optimize too much? :)
  
   Shai
  
  
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail:

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7666 - Failure

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7666/

All tests passed

Build Log (for compile errors):
[...truncated 7968 lines...]
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList whitetok = fieldNames.get(whitetok);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = whitetok.get(index);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = whitetok.get(query);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = keywordtok.get(index);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = keywordtok.get(query);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList fieldTypes = result.get(field_types);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList textType = 
fieldTypes.get(charfilthtmlmap);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331:
 warning: [unchecked] unchecked cast
[javac] found   : java.lang.Object
[javac] required: java.util.Listorg.apache.solr.common.util.NamedList
[javac] ListNamedList tokenList = 
(ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, 
default);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155:
 warning: [unchecked]

Re: modularization discussion

2011-05-02 Thread Grant Ingersoll


On Apr 27, 2011, at 11:45 PM, Greg Stein wrote:

 On Wed, Apr 27, 2011 at 09:25:14AM -0400, Yonik Seeley wrote:
 ...
 But as I said... it seems only fair to meet half way and use the solr 
 namespace
 for some modules and the lucene namespace for others.
 
 Please explain this part to me... I really don't understand.

At the risk of speaking for someone else, I think it has to do w/ wanting to 
maintain brand awareness for Solr.  We, as the PMC, currently produce two 
products:  Apache Lucene and Apache Solr.  I believe Yonik's concern is that if 
everything is just labeled Lucene, then Solr is just seen as a very thin shell 
around Lucene (which, IMO, would still not be the case, since wiring together a 
server app like Solr is non-trivial, but that is my opinion and I'm not sure if 
Yonik share's it).  Solr has never been a thin shell around Lucene and never 
will be.   However, In some ways, this gets at why I believe Yonik was 
interested in a Solr TLP: so that Solr could stand on it's own as a brand and 
as a first class Apache product steered by a PMC that is aligned solely w/ 
producing the Solr (i.e. as a TLP) product as opposed to the two products we 
produce now.  (Note, my vote on such a TLP was -1, so please don't confuse me 
as arguing for the point, I'm just trying to, hopefully, explain it)

That being said, 99% of consumers of Solr never even know what is in the 
underlying namespace b/c they only ever interact w/ Solr via HTTP (which has 
solr in the namespace by default) at the server API level, so at least in my 
mind, I don't care what the namespace used underneath is.  Call it lusolr for 
all I care.

 
 What does fairness have to do with the codebase?

I can't speak to this, but perhaps it's just the wrong choice of words and 
would have been better said: please don't take this as a reason to gut Solr and 
call everything Lucene.

 Isn't the whole
 point of the Lucene project to create the best code possible, for the
 benefit of our worldwide users?

It is.  We do that primarily through the release of two products: Lucene and 
Solr.  Lucene is a Java class library.  A good deal of programming is required 
to create anything meaningful in terms of a production ready search server.  
Solr is a server that takes and makes most things that are programming tasks in 
Lucene configuration tasks as well as adds a fair bit of functionality 
(distributed search, replication, faceting, auto-suggest, etc.) and is thus 
that much easier to put in production (I've seen people be in production on 
Solr in a matter of days/weeks, I've never seen that in Lucene)  The crux of 
this debate is whether these additional pieces are better served as modules (I 
think they are) or tightly coupled inside of Solr (which does have a few 
benefits from a dev. point of view, even though I firmly believe they are 
outweighed by the positives of modularization.)And, while I think most of 
us agree that modularization makes sense, that doesn't mean there aren't 
reasons against it.  I also believe we need to take it on a case by case basis. 
 I also don't think every patch has to be in it's final place on first commit.  
As Otis so often says, it's just software.  If it doesn't work, change it.  
Thus, if people contribute and it lands in Solr, the committer who commits it 
need not immediately move it (although, hopefully they will) or ask the 
contributor to do so, as that will likely dampen contributions.  Likewise for 
Lucene.  Along with that, if and when others wish to refactor, then they should 
by all means be allowed to do so assuming of course, all tests across both 
products still pass.

In short, I believe people should still contribute where they see they can add 
the most value and according to their time schedules.  Additionally, others who 
have more time or the ability to refactor for reusability should be free to do 
so as well.  

I don't know what the outcome of this thread should be, so I guess we need to 
just move forward and keep coding away and working to make things better.  Do 
others see anything broader here?  A vote?  That would be symbolic, I guess, 
but doesn't force anyone to do anything since there isn't a specific issue at 
hand other than a broad concept that is seen as good.

-Grant
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2399) Solr Admin Interface, reworked


[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027878#comment-13027878
 ] 

Ryan McKinley commented on SOLR-2399:
-

Stefan -- this stuff is looking great!  Would you mind uploading a snapshot of 
your repo to this issue?  I would like to start a branch in the apache repo, 
but we need to have the proper Apache release boxes ticked (part of the process 
when you upload a patch)

 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.0


 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 Quick Tour: [Dashboard|http://files.mathe.is/solr-admin/01_dashboard.png], 
 [Query-Form|http://files.mathe.is/solr-admin/02_query.png], 
 [Plugins|http://files.mathe.is/solr-admin/05_plugins.png], 
 [Logging|http://files.mathe.is/solr-admin/07_logging.png], 
 [Analysis|http://files.mathe.is/solr-admin/04_analysis.png], 
 [Schema-Browser|http://files.mathe.is/solr-admin/06_schema-browser.png], 
 [Dataimport|http://files.mathe.is/solr-admin/08_dataimport.png], 
 [Core-Admin|http://files.mathe.is/solr-admin/09_coreadmin.png]
 Newly created Wiki-Page: http://wiki.apache.org/solr/ReworkedSolrAdminGUI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7667 - Still Failing

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7667/

All tests passed

Build Log (for compile errors):
[...truncated 7958 lines...]
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList whitetok = fieldNames.get(whitetok);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = whitetok.get(index);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = whitetok.get(query);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = keywordtok.get(index);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = keywordtok.get(query);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList fieldTypes = result.get(field_types);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList textType = 
fieldTypes.get(charfilthtmlmap);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331:
 warning: [unchecked] unchecked cast
[javac] found   : java.lang.Object
[javac] required: java.util.Listorg.apache.solr.common.util.NamedList
[javac] ListNamedList tokenList = 
(ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, 
default);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155:
 warning: [unchecked]

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027892#comment-13027892
 ] 

Uwe Schindler commented on LUCENE-3054:
---

{quote}
Maybe we leave our QS as is (except, changing the 40 to be dynamic
depending on input length), noting that you should not use it if your
comparator does not break ties, and even if it does there are still
risks because of potentially bad pivot selection?
{quote}

That looks like this: http://en.wikipedia.org/wiki/Introsort

We only need a good recursion depth where to switch!

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7668 - Still Failing

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7668/

All tests passed

Build Log (for compile errors):
[...truncated 7958 lines...]
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList whitetok = fieldNames.get(whitetok);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = whitetok.get(index);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = whitetok.get(query);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList keywordtok = fieldNames.get(keywordtok);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:291:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] indexPart = keywordtok.get(index);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:299:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListjava.util.Listorg.apache.solr.common.util.NamedList
[javac] queryPart = keywordtok.get(query);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:320:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList fieldTypes = result.get(field_types);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:322:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedListorg.apache.solr.common.util.NamedList
[javac] NamedListNamedList textType = 
fieldTypes.get(charfilthtmlmap);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:331:
 warning: [unchecked] unchecked cast
[javac] found   : java.lang.Object
[javac] required: java.util.Listorg.apache.solr.common.util.NamedList
[javac] ListNamedList tokenList = 
(ListNamedList)indexPart.get(org.apache.lucene.analysis.core.WhitespaceTokenizer);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:154:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] spellchecker.add(AbstractLuceneSpellChecker.DICTIONARY_NAME, 
default);
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/handler/component/SpellCheckComponentTest.java:155:
 warning: [unchecked]

[jira] [Commented] (SOLR-2191) Change SolrException cstrs that take Throwable to default to alreadyLogged=false

2011-05-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027915#comment-13027915
 ] 

Hoss Man commented on SOLR-2191:


Is anyone else interested in entertaining the notion that the alreadyLogged 
concept is more trouble then it's worth and we should just rip the whole damn 
thing out? (deprecate logOnce, etc...)

is there such a thing as logging an exception too much? and if there is, 
couldn't we fix those code paths to be less chatty? 

 Change SolrException cstrs that take Throwable to default to 
 alreadyLogged=false
 

 Key: SOLR-2191
 URL: https://issues.apache.org/jira/browse/SOLR-2191
 Project: Solr
  Issue Type: Bug
Reporter: Mark Miller
 Fix For: Next

 Attachments: SOLR-2191.patch


 Because of misuse, many exceptions are now not logged at all - can be painful 
 when doing dev. I think we should flip this setting and work at removing any 
 double logging - losing logging is worse (and it almost looks like we lose 
 more logging than we would get in double logging) - and bad 
 solrexception/logging patterns are proliferating.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: LUCENE-3054-dynamic.patch

Here a patch which implements what introsort does: if the depth of recursion is 
75% of log2(n), switch to mergeSort.

Also this patch moves all remaining quickSort calls to mergeSort on search 
side, where the comparators are not good. A few remaining ones in indexer keep 
alive, but those are all unique sets of terms or field names (needs some more 
review tomorrow).

Mike: What do you think, maybe you can do some benchmarking?

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-dynamic.patch, 
 LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2484) Make SynonymFilterFactory more extendable

2011-05-02 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027920#comment-13027920
 ] 

Steven Rowe commented on SOLR-2484:
---

Ryan,

[Jenkins|https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7666/]
 is unhappy with {{import visad.UnimplementedException}}:

{noformat}
compileTests:
[mkdir] Created dir: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/tests
[javac] Compiling 264 source files to 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/tests
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:32:
 package visad does not exist
[javac] import visad.UnimplementedException;
[javac] ^
{noformat}

 Make SynonymFilterFactory more extendable
 -

 Key: SOLR-2484
 URL: https://issues.apache.org/jira/browse/SOLR-2484
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Fix For: 3.2, 4.0

 Attachments: SOLR-2484-SynonymFilterFactory.patch


 As is, reading rules from the ResourceLoader is baked into 
 inform(ResourceLoader loader).  It would be nice to be able to load custom 
 rules w/o needing to rewrite the whole thing.  This issue changes two things:
 # Changes ListString to IterableString because we don't really need a list
 # adds protected IterableString loadRules( String synonyms, ResourceLoader 
 loader ) -- so subclasses could fill their own

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: modularization discussion

2011-05-02 Thread Ryan McKinley



 In short, I believe people should still contribute where they see they can
 add the most value and according to their time schedules.  Additionally,
 others who have more time or the ability to refactor for reusability should
 be free to do so as well.


I agree that people should be able to contribute where they can; at the same
time as a single unified project (lucene+solr) I think there is an objective
'right' place for things -- code designed to have maximum utility and
reusablity (minimum dependencies without sacrificing functionality).

Starting things in the right place is often easier then refactoring later --
that said, i don't think it should be a requirement as long as we all agree
that things can (and should) be moved to a more reusable place if someone is
willing to do the work.

Thinking about the issue that triggered this debate... in SOLR-2272 (the
pseudo-join stuff), I think the heart of the problem was the idea that once
committed, this new feature could not be moved around.  With this
discussion, I think we agree that it should be refactored if someone is
willing to do the work.  It may even be reasonable for someone to mark it as
@lucene.experimental if there is serious concern about how hard it is to
refactor (and that person is planning to put in some effort to move things
in the right direction)

ryan

Re: jira issues falling off the radar -- Next JIRA version

2011-05-02 Thread Chris Hostetter

: It'd be nice if Jira could auto-magically treat Next as whatever
: release really is next. EG, say we all agree 3.2 is our next
: release, then ideally Jira would treat all Next issues as if they were
: marked with 3.2.

FWIW: you can rename jira versions w/o losing information about what
issues were associated with that version. (It's useful when you have
release code names before you know what the version will actually be) but
i don't really think we need to utilize that -- our release process makes
it pretty self evident what the next release on any given branch will be

: But... lacking that, maybe we really shouldn't use Next at all, and

Agreed. I take most of the blame for introducing hte concept of next...

http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3calpine.deb.1.10.1005251052040.24...@radix.cryptio.net%3E

... but in my defense: no one said they thought it was a bad idea.

The way things shook out after the 3x branch was created just evolved
differently then i anticipated, resulting in no *need* to track this
concept as an abstract version -- we have feature releases from both
branches, and people use their judegment to decide which features hsould
be backported, updating Jira as the go.

We should definitely kill of Next ... i would suggest just removing it,
and not bulk applying a new version (there is no requirement that issues
have a version)

: On a related note, I don't know what to make of the 1.5 version, nor what
: to make of issues marked as Closed for Next. Some house cleaning is in
: order.
:
: We should clean these up. Should we just roll them over to 3.2?

see the above link about the clean up i already did for the the 1.5 Fix
(an easier view of hte full work log is via markmail:
http://markmail.org/thread/7r4lfqddmjkqa3qy ). I made a concious decision
at that time not to *remove* the 1.5 version from any issue, because those
fixes/features do in fact exist on the 1.5 branch, which still exists,
instead i focused on trying to ensure that all those issues had the
*other* version info (ie: 3.1 or 4.0) tracked on them properly (as best i
could based on CHANGES.txt)

I still don't think it makes sense to *remove* the 1.5 version completely,
but I went ahead updated Jira to change the status of 1.5 to Archived
-- so it no longer shows up as an option when editing or searching ofr
issues, but if you look at an issue that is mapped to 1.5 that metadata is
still there.

As far as the 27 issues that are Closed|Resolve AND Next ... it looks
like most of them are issues that were either duplicates, or abandoned (in
which case people rarely remember to unset the version info)...

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+SOLR+AND+resolution+in+%28Fixed%2C+%22Won%27t+Fix%22%2C+Duplicate%2C+Invalid%2C+Incomplete%2C+%22Cannot+Reproduce%22%2C+Later%2C+%22Not+A+Problem%22%29+AND+fixVersion+%3D+12315093

...it looks like only 4 of them were genuinely Fixed and need their
version info updated (which just means auditing hte commits to seee what
branches and when) ... the rest can probably e ingored if we just delete
Next as aversion.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


 [ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3054:
--

Attachment: (was: LUCENE-3054-dynamic.patch)

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-dynamic.patch, 
 LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3054) SorterTemplate.quickSort stack overflows on broken comparators that produce only few disticnt values in large arrays


[ 
https://issues.apache.org/jira/browse/LUCENE-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027948#comment-13027948
 ] 

Uwe Schindler commented on LUCENE-3054:
---

Studying the C++ STL code showed that they use 2 * log2(n) as depth limit. I 
implemented that. It showed that for the most cases in Lucene (BytesRefHash), 
it uses quicksort (so no change to performance). The other cases use already 
mergeSort and the bad test in TestArrayUtil switches sucessfully to mergeSort.

 SorterTemplate.quickSort stack overflows on broken comparators that produce 
 only few disticnt values in large arrays
 

 Key: LUCENE-3054
 URL: https://issues.apache.org/jira/browse/LUCENE-3054
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-3054-dynamic.patch, 
 LUCENE-3054-stackoverflow.patch, LUCENE-3054.patch, LUCENE-3054.patch, 
 LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch, LUCENE-3054.patch


 Looking at Otis's sort problem on the mailing list, he said:
 {noformat}
 * looked for other places where this call is made - found it in
 MultiPhraseQuery$MultiPhraseWeight and changed that call from
 ArrayUtil.quickSort to ArrayUtil.mergeSort
 * now we no longer see SorterTemplate.quickSort in deep recursion when we do a
 thread dump
 {noformat}
 I thought this was interesting because PostingsAndFreq's comparator
 looks like it needs a tiebreaker.
 I think in our sorts we should add some asserts to try to catch some of these 
 broken comparators.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3063) factor CharTokenizer/CharacterUtils into analyzers module


 [ 
https://issues.apache.org/jira/browse/LUCENE-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-3063.
-

Resolution: Fixed

Committed revision 1098871.

If there are any problems with hudson i'll yank it... for now I'll open a 
followup issue to add the additional checks to MockTokenizer

 factor CharTokenizer/CharacterUtils into analyzers module
 -

 Key: LUCENE-3063
 URL: https://issues.apache.org/jira/browse/LUCENE-3063
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3063.patch


 Currently these analysis components are in the lucene core, but should really
 be .util in the analyzers module.
 Also, with MockTokenizer extending Tokenizer directly, we can add some 
 additional
 checks in the future to try to ensure our consumers are being good consumers 
 (e.g. calling reset).
 This is mentioned in http://wiki.apache.org/lucene-java/TestIdeas, I didn't 
 implement it here yet,
 this is just the factoring. I think we should try to do this before 
 LUCENE-3040.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 7670 - Failure

Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7670/

2 tests failed.
FAILED:  
org.apache.lucene.util.automaton.TestLevenshteinAutomata.testUpdateSameDoc

Error Message:
Forked Java VM exited abnormally. Please note the time in the report does not 
reflect the time until the VM exit.

Stack Trace:
junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please 
note the time in the report does not reflect the time until the VM exit.
at java.lang.Thread.run(Thread.java:636)


FAILED:  TEST-org.apache.lucene.index.TestRollingUpdates.xml.init

Error Message:


Stack Trace:
Test report file 
/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/TEST-org.apache.lucene.index.TestRollingUpdates.xml
 was length 0



Build Log (for compile errors):
[...truncated 3174 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3064) add checks to MockTokenizer to enforce proper consumption