from:"Michael McCandless $Commented$ $JIRA$"

[jira] [Commented] (LUCENE-3997) join module should not depend on grouping module

2012-04-18 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257191#comment-13257191
 ] 

Michael McCandless commented on LUCENE-3997:


+1

 join module should not depend on grouping module
 

 Key: LUCENE-3997
 URL: https://issues.apache.org/jira/browse/LUCENE-3997
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 4.0
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3997.patch, LUCENE-3997.patch


 I think TopGroups/GroupDocs should simply be in core? 
 Both grouping and join modules use these trivial classes, but join depends on 
 grouping just for them.
 I think its better that we try to minimize these inter-module dependencies.
 Of course, another option is to combine grouping and join into one module, but
 last time i brought that up nobody could agree on a name. 
 Anyway I think the change is pretty clean: its similar to having basic stuff 
 like Analyzer.java in core,
 so other things can work with Analyzer without depending on any specific 
 implementing modules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3972) Improve AllGroupsCollector implementations

2012-04-12 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252315#comment-13252315
 ] 

Michael McCandless commented on LUCENE-3972:


Curious that it's so much faster ... BytesRefHash operates on the byte[] term 
while the current approach operates on int ord.

How large was the index?  If it was smallish, maybe the time was dominated by 
re-ord'ing after each reader...?

 Improve AllGroupsCollector implementations
 --

 Key: LUCENE-3972
 URL: https://issues.apache.org/jira/browse/LUCENE-3972
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/grouping
Reporter: Martijn van Groningen
 Attachments: LUCENE-3972.patch


 I think that the performance of TermAllGroupsCollectorm, 
 DVAllGroupsCollector.BR and DVAllGroupsCollector.SortedBR can be improved by 
 using BytesRefHash to store the groups instead of an ArrayList.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3972) Improve AllGroupsCollector implementations

2012-04-12 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252518#comment-13252518
 ] 

Michael McCandless commented on LUCENE-3972:


Actually, we are storing term ords here, not docIDs.

I think the high number of unique groups explains why the new patch is
faster: the time is likely dominated by re-ord'ing for each segment?

If you have fewer unique groups (and as the number of docs collected goes up),
I think the current impl should be faster...?


 Improve AllGroupsCollector implementations
 --

 Key: LUCENE-3972
 URL: https://issues.apache.org/jira/browse/LUCENE-3972
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/grouping
Reporter: Martijn van Groningen
 Attachments: LUCENE-3972.patch, LUCENE-3972.patch


 I think that the performance of TermAllGroupsCollectorm, 
 DVAllGroupsCollector.BR and DVAllGroupsCollector.SortedBR can be improved by 
 using BytesRefHash to store the groups instead of an ArrayList.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3970) Rename getUnique[Field/Terms]Count() into size()

2012-04-10 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251004#comment-13251004
 ] 

Michael McCandless commented on LUCENE-3970:


Thanks Iulius, this looks good ... I'll commit shortly.

 Rename getUnique[Field/Terms]Count() into size()
 

 Key: LUCENE-3970
 URL: https://issues.apache.org/jira/browse/LUCENE-3970
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index
Reporter: Iulius Curt
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3970.patch


 Like Robert Muir said in LUCENE-3109:
 {quote}Also I think there are other improvements we can do here that would be 
 more natural:
 Fields.getUniqueFieldCount() - Fields.size()
 Terms.getUniqueTermCount() - Terms.size(){quote}
 I believe this dramatically improves understandability (way less 'scary', 
 actually beautiful).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-08 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249512#comment-13249512
 ] 

Michael McCandless commented on LUCENE-3109:


Thanks Iulius, looks great!  I'll commit...

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, 
 LUCENE-3109.patch, LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3963) improve smoketester to work on windows

2012-04-08 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249516#comment-13249516
 ] 

Michael McCandless commented on LUCENE-3963:


+1

 improve smoketester to work on windows
 --

 Key: LUCENE-3963
 URL: https://issues.apache.org/jira/browse/LUCENE-3963
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-3963.patch


 After the changes in SOLR-3331, the smoketester won't work on windows
 (things like path separators of : or ;).
 Not really critical, people will just have to smoketest on
 unix-like machines. But it would be more convenient for testers on
 windows machines if it worked there too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-08 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249520#comment-13249520
 ] 

Michael McCandless commented on LUCENE-3109:


bq. We need to change CHANGES.txt and MIGRATE.txt to the new API, it's now 
heavily outdated.

Thanks Uwe, you're right, my bad.

bq. Should we change AtomicReader to have invertedField() instead fields()? 

+1

bq. Also the name FieldsEnum is now inconsistent.

I think it should be InvertedFieldsEnum?

Iulius do you want to make these changes?  Or I can... let me know.

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, 
 LUCENE-3109.patch, LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-08 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249546#comment-13249546
 ] 

Michael McCandless commented on LUCENE-3109:


OK I'll revert so we can discuss more...

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, 
 LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3967) nuke AtomicReader.termDocsEnum(termState) and termPositionsEnum(termState)

2012-04-08 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249570#comment-13249570
 ] 

Michael McCandless commented on LUCENE-3967:


+1

 nuke AtomicReader.termDocsEnum(termState) and termPositionsEnum(termState)
 --

 Key: LUCENE-3967
 URL: https://issues.apache.org/jira/browse/LUCENE-3967
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-3967.patch


 These are simply sugar methods anyway, and so expert that I don't think we 
 need sugar here at all.
 If someone wants to get DocsEnum via a saved TermState they can just use 
 TermsEnum!
 But having these public in AtomicReader i think is pretty confusing and 
 overwhelming.
 In fact, nothing in Lucene even uses these methods, except a sole assert 
 statement in PhraseQuery, 
 which I think can be written more clearly anyway:
 {noformat}
  // PhraseQuery on a field that did not index
  // positions.
  if (postingsEnum == null) {
 -  assert reader.termDocsEnum(liveDocs, t.field(), t.bytes(), state, 
 false) != null: termstate found but no term exists in reader;
 +  assert te.seekExact(t.bytes(), false) : termstate found but no 
 term exists in reader;
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3965) consolidate all api modules in one place and un!@$# packaging for 4.0

2012-04-07 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249284#comment-13249284
 ] 

Michael McCandless commented on LUCENE-3965:


+1, to moving/merging modules/* and lucene/contrib/* under lucene.  This is 
much cleaner.

 consolidate all api modules in one place and un!@$# packaging for 4.0
 -

 Key: LUCENE-3965
 URL: https://issues.apache.org/jira/browse/LUCENE-3965
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Affects Versions: 4.0
Reporter: Robert Muir

 I think users get confused about how svn/source is structured,
 when in fact we are just producing a modular build.
 I think it would be more clear if the lucene stuff was underneath
 modules/, thats where our modular API is.
 we could still package this up as lucene.tar.gz if we want, and even name
 modules/core lucene-core.jar, but i think this would be a lot better
 organized than the current:
 * lucene
 * lucene/contrib
 * modules
 confusion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-07 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249390#comment-13249390
 ] 

Michael McCandless commented on LUCENE-3109:


Thanks for the fast turnaround Iulius!

Did you use svn mv to rename the sources?  (I'm guessing not -- I don't see 
the removed original sources).

But it's fine: I got this to apply quite easily.  Thanks!  I'll commit 
shortly...

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, 
 LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-07 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249398#comment-13249398
 ] 

Michael McCandless commented on LUCENE-3109:


Hmm, one thing: I noticed the imports got changed into wildcards, eg:
{noformat}
+import org.apache.lucene.index.*;
 import org.apache.lucene.util.LuceneTestCase;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.TextField;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.TermsEnum;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.MultiFields;
+import org.apache.lucene.index.MultiInvertedFields;
{noformat}

In general I prefer seeing each import (not the wildcard)... can you redo patch 
putting them back?  Thanks!

(I'm assuming/hoping this is a simple setting in your IDE?).

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, 
 LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3331) solr NOTICE.txt is missing information

2012-04-06 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248263#comment-13248263
]

Michael McCandless commented on SOLR-3331:
--

I'll fix smoke tester...

I already have a bunch of mods to add other checks to it...

solr NOTICE.txt is missing information
--

Key: SOLR-3331
URL: https://issues.apache.org/jira/browse/SOLR-3331
Project: Solr
Issue Type: Bug
Reporter: Robert Muir
Assignee: Michael McCandless
Priority: Blocker
Fix For: 3.6

Solr depends on some modules from lucene, and is released separately (as a
source release including lucene),
thus its NOTICE.txt has a lucene section which includes notices from lucene:
{noformat}
=
== Apache Lucene Notice ==
=
{noformat}
however, its missing the IPADIC (which is required to be there).
Furthermore, there is no way to check this, except via manual inspection.
This gets complicated in 4.0 because of modularization, but we need to fix the
3.6 situation in order to release (hence, this issue is set to 3.6 only and
we can open a separate issue for 4.0 and discuss things like modules there,
its irrelevant here).
My proposal for *3.6* is:
1. add the IPADIC notice
2. have smoketester.py look for this specific block of text indicating
the notices from lucene, and cross check them to ensure everything is
consistent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3316) Distributed Grouping fails in some scenarios.

2012-04-06 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248343#comment-13248343
 ] 

Michael McCandless commented on SOLR-3316:
--

Patch looks good!

I guess it's OK to make the hard change to the EndResultTransformer 
interface... (it's marked @experimental).

 Distributed Grouping fails in some scenarios.
 -

 Key: SOLR-3316
 URL: https://issues.apache.org/jira/browse/SOLR-3316
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 3.4, 3.5
 Environment: Windows 7, JDK 6u26
Reporter: Cody Young
Assignee: Martijn van Groningen
Priority: Blocker
  Labels: distributed, grouping
 Fix For: 4.0

 Attachments: SOLR-3316-3x.patch, SOLR-3316-3x.patch, SOLR-3316.patch, 
 TestDistributedGrouping.java.patch


 During a distributed grouping request, if rows is set to 0 a 500 error is 
 returned.
 If groups are unique to a shard and the row count is set to 1, then the 
 matches number is only the matches from one shard.
 I've put together a failing test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer

2012-04-06 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248604#comment-13248604
 ] 

Michael McCandless commented on LUCENE-3109:


Hi Iulius, this patch is great: this rename is badly needed...

I was able to apply the patch (resolving a few conflicts since the code has 
shifted since it was created), but... some things seem to be missing (eg 
InvertedFieldsProducer rename).  How did you generate the patch?

 Rename FieldsConsumer to InvertedFieldsConsumer
 ---

 Key: LUCENE-3109
 URL: https://issues.apache.org/jira/browse/LUCENE-3109
 Project: Lucene - Java
  Issue Type: Task
  Components: core/codecs
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3109.patch, LUCENE-3109.patch


 The name FieldsConsumer is missleading here it really is an 
 InvertedFieldsConsumer and since we are extending codecs to consume 
 non-inverted Fields we should be clear here. Same applies to Fields.java as 
 well as FieldsProducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3932) Improve load time of .tii files

2012-04-05 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247239#comment-13247239
 ] 

Michael McCandless commented on LUCENE-3932:


OK I committed this to trunk (thanks Sean!).

{quote}
dataPagedBytes.getPointer() == 124973970

On disk the .tii file is 69508193 bytes
{quote}

OK, ~80% bigger... but in the overall index it's minor increase (~0.1%).

But I think we should hold off on any more 3.x work until/unless we decide to 
do another release off of it

 Improve load time of .tii files
 ---

 Key: LUCENE-3932
 URL: https://issues.apache.org/jira/browse/LUCENE-3932
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.5
 Environment: Linux
Reporter: Sean Bridges
 Attachments: LUCENE-3932.trunk.patch, perf.csv


 We have a large 50 gig index which is optimized as one segment, with a 66 MEG 
 .tii file.  This index has no norms, and no field cache.
 It takes about 5 seconds to load this index, profiling reveals that 60% of 
 the time is spent in GrowableWriter.set(index, value), and most of time in 
 set(...) is spent resizing PackedInts.Mutatable current.
 In the constructor for TermInfosReaderIndex, you initialize the writer with 
 the line,
 {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, 
 false);{quote}
 For our index using four as the bit estimate results in 27 resizes.
 The last value in indexToTerms is going to be ~ tiiFileLength, and if instead 
 you use,
 {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / 
 Math.log10(2));
 GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, 
 false);{quote}
 Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig

2012-04-04 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246465#comment-13246465
 ] 

Michael McCandless commented on LUCENE-3946:


Shawn, I'm not certain this is the same issue (it talks about an extra trailing 
/ on ANT_HOME, but that didn't help me...), but it seems related: 
https://bugzilla.redhat.com/show_bug.cgi?id=490542


 improve docs  ivy verification output to explain classpath problems and 
 mention --noconfig
 -

 Key: LUCENE-3946
 URL: https://issues.apache.org/jira/browse/LUCENE-3946
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.6
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 4.0

 Attachments: LUCENE-3946.patch


 offshoot of LUCENE-3930, where shawn reported...
 {quote}
 I can't get either branch_3x or trunk to build now, on a system that used to 
 build branch_3x without complaint.  It
 says that ivy is not available, even after doing ant ivy-bootstrap to 
 download ivy into the home directory.
 Specifically I am trying to build solrj from trunk, but I can't even get 
 ant in the root directory of the checkout
 to work.  I'm on CentOS 6 with oracle jdk7 built using the city-fan.org 
 SRPMs.  Ant (1.7.1) and junit are installed
 from package repositories.  Building a checkout of lucene_solr_3_5 on the 
 same machine works fine.
 {quote}
 The root cause is that ant's global configs can be setup to ignore the users 
 personal lib dir.  suggested work arround is to run ant --noconfig but we 
 should also try to give the user feedback in our failure about exactly what 
 classpath ant is currently using (because apparently ${java.class.path} is 
 not actually it)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3943) Use ivy cachepath and cachefileset instead of ivy retrieve

2012-04-03 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245123#comment-13245123
 ] 

Michael McCandless commented on LUCENE-3943:


Would this also mean ivy doesn't have to copy the JARs from its cache, into 
each checkout?  So this will save disk space for devs w/ multiple checkouts...

 Use ivy cachepath and cachefileset instead of ivy retrieve
 --

 Key: LUCENE-3943
 URL: https://issues.apache.org/jira/browse/LUCENE-3943
 Project: Lucene - Java
  Issue Type: Improvement
  Components: general/build
Reporter: Chris Male

 In LUCENE-3930 we moved to resolving all external dependencies using 
 ivy:retrieve.  This process places the dependencies into the lib/ folder of 
 the respective modules which was ideal since it replicated the existing build 
 process and limited the number of changes to be made to the build.
 However it can lead to multiple jars for the same dependency in the lib 
 folder when the dependency is upgraded, and just isn't the most efficient way 
 to use Ivy.
 Uwe pointed out that we can remove the ivy:retrieve calls and make use of 
 ivy:cachepath and ivy:cachefileset to build our classpaths and packages 
 respectively, which will go some way to addressing these limitations

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2026) Refactoring of IndexWriter

2012-04-03 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245125#comment-13245125
]

Michael McCandless commented on LUCENE-2026:

Is there anyone who can volunteer to be a mentor for this issue...?

Refactoring of IndexWriter
--

Key: LUCENE-2026
URL: https://issues.apache.org/jira/browse/LUCENE-2026
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Labels: gsoc2011, gsoc2012, lucene-gsoc-11, lucene-gsoc-12,
mentor
Fix For: 4.0

I've been thinking for a while about refactoring the IndexWriter into
two main components.
One could be called a SegmentWriter and as the
name says its job would be to write one particular index segment. The
default one just as today will provide methods to add documents and
flushes when its buffer is full.
Other SegmentWriter implementations would do things like e.g. appending or
copying external segments [what addIndexes*() currently does].
The second component's job would it be to manage writing the segments
file and merging/deleting segments. It would know about
DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
provide hooks that allow users to manage external data structures and
keep them in sync with Lucene's data during segment merges.
API wise there are things we have to figure out, such as where the
updateDocument() method would fit in, because its deletion part
affects all segments, whereas the new document is only being added to
the new segment.
Of course these should be lower level APIs for things like parallel
indexing and related use cases. That's why we should still provide
easy to use APIs like today for people who don't need to care about
per-segment ops during indexing. So the current IndexWriter could
probably keeps most of its APIs and delegate to the new classes.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping

2012-04-03 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245343#comment-13245343
]

Michael McCandless commented on LUCENE-2357:

Hi Iulius,

The basic idea is to replace the fixed int[] that we now have (in
oal.index.MergeState's docMaps array) with a PackedInts store (see
oal.util.packed.PackedInts.getMutable). This should be fairly simple, since a
PackedInts store is concetually just like an int[].

I think that (a rote swap) would be phase one.

After that, we can save more RAM by storing either the new docID (what we do
today), or, inverting that and storing instead the number of del docs seen so
far, depending on which requires fewer bits. EG if we are merging 1M docs but
only 100K are deleted it's cheaper to store the number of deletes...

Reduce transient RAM usage while merging by using packed ints array for docID
re-mapping

Key: LUCENE-2357
URL: https://issues.apache.org/jira/browse/LUCENE-2357
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Reporter: Michael McCandless
Priority: Minor
Labels: gsoc2012, lucene-gsoc-12
Fix For: 4.0

We allocate this int[] to remap docIDs due to compaction of deleted ones.
This uses alot of RAM for large segment merges, and can fail to allocate due
to fragmentation on 32 bit JREs.
Now that we have packed ints, a simple fix would be to use a packed int
array... and maybe instead of storing abs docID in the mapping, we could
store the number of del docs seen so far (so the remap would do a lookup then
a subtract). This may add some CPU cost to merging but should bring down
transient RAM usage quite a bit.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-03 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245374#comment-13245374
]

Michael McCandless commented on LUCENE-3892:

The proposal at
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1
looks great! Some initial feedback:

* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.

* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can just implement the required methods (eg to
encode/decode one int[] block).

* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).

* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...

Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta,
Simple9/16/64, etc.)
-

Key: LUCENE-3892
URL: https://issues.apache.org/jira/browse/LUCENE-3892
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Labels: gsoc2012, lucene-gsoc-12
Fix For: 4.0

On the flex branch we explored a number of possible intblock
encodings, but for whatever reason never brought them to completion.
There are still a number of issues opened with patches in different
states.
Initial results (based on prototype) were excellent (see
http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
).
I think this would make a good GSoC project.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3930) nuke jars from source tree and use ivy

2012-04-03 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245799#comment-13245799
 ] 

Michael McCandless commented on LUCENE-3930:


Shawn, I have similar problems with the builtin ant on Fedora 13, and when I 
add that same echo line, I can see that the ~/.ant/lib/ivy-2.2.0.jar is on the 
CLASSPATH... yet it fails the ivy-availability-check.

I never got to the bottom of it ... but installing ant myself (1.8.2) and using 
that version instead worked around it...

 nuke jars from source tree and use ivy
 --

 Key: LUCENE-3930
 URL: https://issues.apache.org/jira/browse/LUCENE-3930
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3930-skip-sources-javadoc.patch, 
 LUCENE-3930-solr-example.patch, LUCENE-3930-solr-example.patch, 
 LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930.patch, 
 LUCENE-3930__ivy_bootstrap_target.patch, 
 LUCENE-3930_includetestlibs_excludeexamplexml.patch, 
 ant_-verbose_clean_test.out.txt, langdetect-1.1.jar, 
 noggit-commons-csv.patch, patch-jetty-build.patch, pom.xml


 As mentioned on the ML thread: switch jars to ivy mechanism?.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig

2012-04-03 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245900#comment-13245900
 ] 

Michael McCandless commented on LUCENE-3946:


Patch works -- I see lots of JARs on the classpath:

{noformat}
 [echo]  Current Classpath:
 [echo]/usr/share/java/ant.jar
 [echo]/usr/share/java/ant-launcher.jar
 [echo]/usr/share/java/jaxp_parser_impl.jar
 [echo]/usr/share/java/xml-commons-apis.jar
 [echo]/usr/share/java/antlr.jar
 [echo]/usr/share/java/ant/ant-antlr.jar
 [echo]/usr/share/java/bcel.jar
 [echo]/usr/share/java/ant/ant-apache-bcel.jar
 [echo]/usr/share/java/oro.jar
 [echo]/usr/share/java/ant/ant-apache-oro.jar
 [echo]/usr/share/java/regexp.jar
 [echo]/usr/share/java/ant/ant-apache-regexp.jar
 [echo]/usr/share/java/xml-commons-resolver.jar
 [echo]/usr/share/java/ant/ant-apache-resolver.jar
 [echo]/usr/share/java/jakarta-commons-logging.jar
 [echo]/usr/share/java/ant/ant-commons-logging.jar
 [echo]/usr/share/java/javamail.jar
 [echo]/usr/share/java/jaf.jar
 [echo]/usr/share/java/ant/ant-javamail.jar
 [echo]/usr/share/java/jdepend.jar
 [echo]/usr/share/java/ant/ant-jdepend.jar
 [echo]/usr/share/java/junit.jar
 [echo]/usr/share/java/ant/ant-junit.jar
 [echo]/usr/share/java/ant/ant-nodeps.jar
 [echo]/usr/share/java/ant/ant-swing.jar
 [echo]/usr/share/java/jaxp_transform_impl.jar
 [echo]/usr/share/java/ant/ant-trax.jar
 [echo]/usr/share/java/xalan-j2-serializer.jar
 [echo]/usr/local/src/jdk1.6.0_21/lib/tools.jar
 [echo]/home/mike/.ant/lib/maven-ant-tasks-2.1.3.jar
 [echo]/home/mike/.ant/lib/ivy-2.2.0.jar
 [echo]/usr/share/ant/lib/ant-swing.jar
 [echo]/usr/share/ant/lib/ant-launcher.jar
 [echo]/usr/share/ant/lib/ant-junit.jar
 [echo]/usr/share/ant/lib/ant-bootstrap.jar
 [echo]/usr/share/ant/lib/ant-apache-bcel.jar
 [echo]/usr/share/ant/lib/ant-apache-oro.jar
 [echo]/usr/share/ant/lib/ant-nodeps.jar
 [echo]/usr/share/ant/lib/ant-apache-resolver.jar
 [echo]/usr/share/ant/lib/ant-trax.jar
 [echo]/usr/share/ant/lib/ant-apache-log4j.jar
 [echo]/usr/share/ant/lib/ant-antlr.jar
 [echo]/usr/share/ant/lib/ant-javamail.jar
 [echo]/usr/share/ant/lib/ant-jdepend.jar
 [echo]/usr/share/ant/lib/ant-apache-regexp.jar
 [echo]/usr/share/ant/lib/ant-commons-logging.jar
{noformat}

That's just running ant, and it fails... ant --noconfig works (fortunately 
I don't have/need ~/.antrc).  Here's my /etc/ant.conf:

{noformat}
# ant.conf (Ant 1.7.x)
# JPackage Project http://www.jpackage.org/

# Validate --noconfig setting in case being invoked
# from pre Ant 1.6.x environment
if [ -z $no_config ] ; then
  no_config=true
fi

# Setup ant configuration
if $no_config ; then
  # Disable RPM layout
  rpm_mode=false
else
  # Use RPM layout
  rpm_mode=true

  # ANT_HOME for rpm layout
  ANT_HOME=/usr/share/ant
fi
{noformat}


 improve docs  ivy verification output to explain classpath problems and 
 mention --noconfig
 -

 Key: LUCENE-3946
 URL: https://issues.apache.org/jira/browse/LUCENE-3946
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.6
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 4.0

 Attachments: LUCENE-3946.patch


 offshoot of LUCENE-3930, where shawn reported...
 {quote}
 I can't get either branch_3x or trunk to build now, on a system that used to 
 build branch_3x without complaint.  It
 says that ivy is not available, even after doing ant ivy-bootstrap to 
 download ivy into the home directory.
 Specifically I am trying to build solrj from trunk, but I can't even get 
 ant in the root directory of the checkout
 to work.  I'm on CentOS 6 with oracle jdk7 built using the city-fan.org 
 SRPMs.  Ant (1.7.1) and junit are installed
 from package repositories.  Building a checkout of lucene_solr_3_5 on the 
 same machine works fine.
 {quote}
 The root cause is that ant's global configs can be setup to ignore the users 
 personal lib dir.  suggested work arround is to run ant --noconfig but we 
 should also try to give the user feedback in our failure about exactly what 
 classpath ant is currently using (because apparently ${java.class.path} is 
 not actually it)

--
This message is automatically generated by JIRA.
If you think it was

[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig

2012-04-03 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245902#comment-13245902
 ] 

Michael McCandless commented on LUCENE-3946:


I passed --execdebug to ant, and when it fails (w/ the builtin Fedora ant) I 
get this:
{noformat}
exec /usr/local/src/jdk1.6.0_21/bin/java -classpath 
/usr/share/java/ant.jar:/usr/share/java/ant-launcher.jar:/usr/share/java/jaxp_parser_impl.jar:/usr/share/java/xml-commons-apis.jar:/usr/share/java/antlr.jar:/usr/share/java/ant/ant-antlr.jar:/usr/share/java/bcel.jar:/usr/share/java/ant/ant-apache-bcel.jar:/usr/share/java/ant.jar:/usr/share/java/oro.jar:/usr/share/java/ant/ant-apache-oro.jar:/usr/share/java/regexp.jar:/usr/share/java/ant/ant-apache-regexp.jar:/usr/share/java/xml-commons-resolver.jar:/usr/share/java/ant/ant-apache-resolver.jar:/usr/share/java/jakarta-commons-logging.jar:/usr/share/java/ant/ant-commons-logging.jar:/usr/share/java/javamail.jar:/usr/share/java/jaf.jar:/usr/share/java/ant/ant-javamail.jar:/usr/share/java/jdepend.jar:/usr/share/java/ant/ant-jdepend.jar:/usr/share/java/junit.jar:/usr/share/java/ant/ant-junit.jar:/usr/share/java/ant/ant-nodeps.jar:/usr/share/java/ant/ant-swing.jar:/usr/share/java/jaxp_transform_impl.jar:/usr/share/java/ant/ant-trax.jar:/usr/share/java/xalan-j2-serializer.jar:/usr/local/src/jdk1.6.0_21/lib/tools.jar
 -Dant.home=/usr/share/ant -Dant.library.dir=/usr/share/ant/lib 
org.apache.tools.ant.launch.Launcher -cp 
{noformat}

and then when I switch to the working ant:
{noformat}
exec /usr/local/src/jdk1.6.0_21/jre/bin/java -classpath 
/usr/local/src/apache-ant-1.8.2//lib/ant-launcher.jar 
-Dant.home=/usr/local/src/apache-ant-1.8.2/ 
-Dant.library.dir=/usr/local/src/apache-ant-1.8.2//lib 
org.apache.tools.ant.launch.Launcher -cp 
{noformat}


 improve docs  ivy verification output to explain classpath problems and 
 mention --noconfig
 -

 Key: LUCENE-3946
 URL: https://issues.apache.org/jira/browse/LUCENE-3946
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.6
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 4.0

 Attachments: LUCENE-3946.patch


 offshoot of LUCENE-3930, where shawn reported...
 {quote}
 I can't get either branch_3x or trunk to build now, on a system that used to 
 build branch_3x without complaint.  It
 says that ivy is not available, even after doing ant ivy-bootstrap to 
 download ivy into the home directory.
 Specifically I am trying to build solrj from trunk, but I can't even get 
 ant in the root directory of the checkout
 to work.  I'm on CentOS 6 with oracle jdk7 built using the city-fan.org 
 SRPMs.  Ant (1.7.1) and junit are installed
 from package repositories.  Building a checkout of lucene_solr_3_5 on the 
 same machine works fine.
 {quote}
 The root cause is that ant's global configs can be setup to ignore the users 
 personal lib dir.  suggested work arround is to run ant --noconfig but we 
 should also try to give the user feedback in our failure about exactly what 
 classpath ant is currently using (because apparently ${java.class.path} is 
 not actually it)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3296) Explore alternatives to Commons CSV

2012-04-02 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244502#comment-13244502
]

Michael McCandless commented on SOLR-3296:
--

bq. wrt commons-csv alternatives, it's too risky for little/no gain.

This confuses me: commons-csv is unreleased, while there are other
license-friendly packages (eg opencsv) that have been released for
some time (multiple releases), been tested in the field, had bugs
found fixed, etc.

Why use an unreleased package when released alternatives are
available?

bq. I put a lot of effort into getting commons-csv up to snuff,

Wait: a lot of effort doing what? Did you have to modify commons-csv
sources? Or do you mean open issues w/ the commons devs to fix
things/add test cases to commons-csv sources (great!)...?

bq. Switching implementations would most likely result in a lot of regressions
that we don't have tests for.

I'd expect the reverse, ie, it's more likely there are bugs in
commons-csv (it's not released and thus not heavily tested) than eg
in opencsv.

And if somehow that's really the case (eg we have particular/unusual
CSV parsing requirements), we should have our own tests asserting so?

Explore alternatives to Commons CSV
---

Key: SOLR-3296
URL: https://issues.apache.org/jira/browse/SOLR-3296
Project: Solr
Issue Type: Improvement
Components: Build
Reporter: Chris Male

In LUCENE-3930 we're implementing some less than ideal solutions to make
available the unreleased version of commons-csv. We could remove these
solutions if we didn't rely on this lib. So I think we should explore
alternatives.
I think [opencsv|http://opencsv.sourceforge.net/] is an alternative to
consider, I've used it in many commercial projects. Bizarrely Commons-CSV's
website says that Opencsv uses a BSD license, but this isn't the case,
OpenCSV uses ASL2.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3939) ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper

2012-04-01 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243727#comment-13243727
 ] 

Michael McCandless commented on LUCENE-3939:


bq. For example, if the first invocation of the method map is commented 
out(as below), then there is no exception thrown. In this case, the Comparator 
is still null.

This is because of sneakiness/trapiness in TreeSet (and maybe Java's type 
erasure for generics), I think.

Ie, on inserting only one object into it, it does not need to cast that object 
to Comparator (there's nothing to compare to).  But on adding a 2nd object, it 
will try to cast.

 ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) 
 method in org.apache.lucene.index.SortedTermVectorMapper
 --

 Key: LUCENE-3939
 URL: https://issues.apache.org/jira/browse/LUCENE-3939
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.0.2, 3.1, 3.4, 3.5
Reporter: SHIN HWEI TAN
   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 The method map in the SortedTermVectorMapper class does not check the 
 parameter term for the valid values. It throws ClassCastException when 
 called with a invalid string for the parameter term (i.e., var3.map(*, 
 (-1), null, null)). The exception thrown is due to an explict cast(i.e., 
 casting the return value of termToTVE.get(term) to type TermVectorEntry). 
 Suggested Fixes: Replace the beginning of the method body for the class 
 SortedTermVectorMapper by changing it like this:
 public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, 
 int[] positions) {
   if(termToTVE.get(term) instanceof TermVectorEntry){
   TermVectorEntry entry = (TermVectorEntry) termToTVE.get(term);
   ...
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3938) Add query time parent child search

2012-04-01 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243763#comment-13243763
]

Michael McCandless commented on LUCENE-3938:

I don't fully grok this yet :) ... but some initial questions:

I'm confused: when you say parent child document, what does that
mean...? I thought there are parent documents and child
documents, in the context of a given join?

Or do you mean parent or child document...? Ie, it looks like your
Query is free to match both parent and child documents...? (Unlike
index-time joins). But then you also have a childrenQuery, which is
only allowed to match docs in the child space...?

Minor: there's an @author tag in ParentChildCommand

Minor: maybe break out ParentChildHit into its own source file...?

Add query time parent child search
--

Key: LUCENE-3938
URL: https://issues.apache.org/jira/browse/LUCENE-3938
Project: Lucene - Java
Issue Type: New Feature
Components: modules/join
Reporter: Martijn van Groningen
Attachments: LUCENE-3938.patch

At the moment there is support for index time parent child search with two
queries implementations and a collector. The index time parent child search
requires that documents are indexed in a block, this isn't ideal for
updatability. For example in the case of tv content and subtitles (both being
separate documents). Updating already indexed tv content with subtitles would
then require to also re-index the subtitles.
This issue focuses on the collector part for query time parent child search.
I started a while back with implementing this. Basically a two pass search
performs a parent child search. In the first pass the top N parent child
documents are resolved. In the second pass the parent or top N children are
resolved (depending if the hit is a parent or child) and are associated with
the top N parent child relation documents. Patch will follow soon.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

2012-04-01 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243782#comment-13243782
 ] 

Michael McCandless commented on LUCENE-3940:


Here's an example where we create a compound token with punctuation.

I got this from the Japanese Wikipedia export, with our MockCharFilter
sometimes doubling characters: we are at a position that the
characters 〇〇'''、''' after it... that 〇 is this Unicode character
http://www.fileformat.info/info/unicode/char/3007/index.htm

When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN,
but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only
punctuation):

Note that 〇 is not considered punctuation by Kuromoji's isPunctuation
method...

{noformat}
  + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0
  + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1
{noformat}

And then on backtrace we make a compound token (UNKNOWN) for all of
〇〇'''、''', while the decompounded path keeps two separate 〇
tokens but drops the '''、''' since it's all punctuation, thus
creating inconsistent offsets.


 When Japanese (Kuromoji) tokenizer removes a punctuation token it should 
 leave a hole
 -

 Key: LUCENE-3940
 URL: https://issues.apache.org/jira/browse/LUCENE-3940
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch


 I modified BaseTokenStreamTestCase to assert that the start/end
 offsets match for graph (posLen  1) tokens, and this caught a bug in
 Kuromoji when the decompounding of a compound token has a punctuation
 token that's dropped.
 In this case we should leave hole(s) so that the graph is intact, ie,
 the graph should look the same as if the punctuation tokens were not
 initially removed, but then a StopFilter had removed them.
 This also affects tokens that have no compound over them, ie we fail
 to leave a hole today when we remove the punctuation tokens.
 I'm not sure this is serious enough to warrant fixing in 3.6 at the
 last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

2012-04-01 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243785#comment-13243785
 ] 

Michael McCandless commented on LUCENE-3940:


OK here's one possible fix...

Right now, when we are glomming up an UNKNOWN token, we glom only as long as 
the character class of each character is the same as the first character.

What if we also require that isPunct-ness is the same?  That way we would never 
create an UNKNOWN token mixing punct and non-punct...

I implemented that and the tests seem to pass w/ offset checking fully turned 
on again...

 When Japanese (Kuromoji) tokenizer removes a punctuation token it should 
 leave a hole
 -

 Key: LUCENE-3940
 URL: https://issues.apache.org/jira/browse/LUCENE-3940
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch


 I modified BaseTokenStreamTestCase to assert that the start/end
 offsets match for graph (posLen  1) tokens, and this caught a bug in
 Kuromoji when the decompounding of a compound token has a punctuation
 token that's dropped.
 In this case we should leave hole(s) so that the graph is intact, ie,
 the graph should look the same as if the punctuation tokens were not
 initially removed, but then a StopFilter had removed them.
 This also affects tokens that have no compound over them, ie we fail
 to leave a hole today when we remove the punctuation tokens.
 I'm not sure this is serious enough to warrant fixing in 3.6 at the
 last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3932) Improve load time of .tii files

2012-03-31 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243102#comment-13243102
 ] 

Michael McCandless commented on LUCENE-3932:


bq. Is the space savings of delta encoding worth the processing time? You could 
write the .tii file to disk such that on open you could read it straight into a 
byte[].

This is actually what we do in 4.0's default codec (the index is an FST).

It is tempting to do that in 3.x (if we were to do another 3.x release after 
3.6) ... we'd need to alter other things as well, eg the term bytes are also 
delta-coded in the file but not in RAM.

I'm curious how much larger it'd be if we stopped delta coding... for your 
case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), 
just before we freeze it, and print that result) vs the tii on disk...?

 Improve load time of .tii files
 ---

 Key: LUCENE-3932
 URL: https://issues.apache.org/jira/browse/LUCENE-3932
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.5
 Environment: Linux
Reporter: Sean Bridges
 Attachments: LUCENE-3932.trunk.patch, perf.csv


 We have a large 50 gig index which is optimized as one segment, with a 66 MEG 
 .tii file.  This index has no norms, and no field cache.
 It takes about 5 seconds to load this index, profiling reveals that 60% of 
 the time is spent in GrowableWriter.set(index, value), and most of time in 
 set(...) is spent resizing PackedInts.Mutatable current.
 In the constructor for TermInfosReaderIndex, you initialize the writer with 
 the line,
 {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, 
 false);{quote}
 For our index using four as the bit estimate results in 27 resizes.
 The last value in indexToTerms is going to be ~ tiiFileLength, and if instead 
 you use,
 {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / 
 Math.log10(2));
 GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, 
 false);{quote}
 Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3939) ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper

2012-03-31 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243108#comment-13243108
 ] 

Michael McCandless commented on LUCENE-3939:


I'm confused on how something's that not a TermVectorEntry can get into the 
termToTVE map... can you post a small test case showing this problem?

 ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) 
 method in org.apache.lucene.index.SortedTermVectorMapper
 --

 Key: LUCENE-3939
 URL: https://issues.apache.org/jira/browse/LUCENE-3939
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.0.2, 3.1, 3.4, 3.5
Reporter: SHIN HWEI TAN
   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 The method map in the SortedTermVectorMapper class does not check the 
 parameter term for the valid values. It throws ClassCastException when 
 called with a invalid string for the parameter term (i.e., var3.map(*, 
 (-1), null, null)). The exception thrown is due to an explict cast(i.e., 
 casting the return value of termToTVE.get(term) to type TermVectorEntry). 
 Suggested Fixes: Replace the beginning of the method body for the class 
 SortedTermVectorMapper by changing it like this:
 public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, 
 int[] positions) {
   if(termToTVE.get(term) instanceof TermVectorEntry){
   TermVectorEntry entry = (TermVectorEntry) termToTVE.get(term);
   ...
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-31 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243125#comment-13243125
 ] 

Michael McCandless commented on LUCENE-3738:


Thanks Uwe, I'll test!

 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Blocker
 Fix For: 3.6, 4.0

 Attachments: ByteArrayDataInput.java.patch, 
 LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch, 
 LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-31 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243138#comment-13243138
]

Michael McCandless commented on LUCENE-3738:

Alas, the results are now all over the place! And I went back to the prior
patch and tried to reproduce the above results... and the results are still all
over the place. I think we are chasing Java ghosts at this point...

Be consistent about negative vInt/vLong
---

Key: LUCENE-3738
URL: https://issues.apache.org/jira/browse/LUCENE-3738
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Blocker
Fix For: 3.6, 4.0

Attachments: ByteArrayDataInput.java.patch,
LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch,
LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch

Today, write/readVInt allows a negative int, in that it will encode and
decode correctly, just horribly inefficiently (5 bytes).
However, read/writeVLong fails (trips an assert).
I'd prefer that both vInt/vLong trip an assert if you ever try to write a
negative number... it's badly trappy today. But, unfortunately, we sometimes
rely on this... had we had this assert in 'since the beginning' we could have
avoided that.
So, if we can't add that assert in today, I think we should at least fix
readVLong to handle negative longs... but then you quietly spend 9 bytes
(even more trappy!).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-31 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243293#comment-13243293
 ] 

Michael McCandless commented on LUCENE-3738:


Sorry Uwe, that was exactly it: I don't know what to conclude from the perf 
runs anymore.

But +1 for your new patch: it ought to be better since the code is simpler.

 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Blocker
 Fix For: 3.6, 4.0

 Attachments: ByteArrayDataInput.java.patch, 
 LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch, 
 LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

2012-03-31 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243299#comment-13243299
]

Michael McCandless commented on LUCENE-3940:

bq. StandardTokenizer doesnt leave holes when it drops punctuation,

But is that really good?

This means a PhraseQuery will match across end-of-sentence (.),
semicolon, colon, comma, etc. (English examples..).

I think tokenizers should throw away as little information as
possible... we can always filter out such tokens in a later stage?

For example, if a tokenizer created punct tokens (instead of silently
discarding them), other token filters could make use of them in the
mean time, eg a synonym rule for u.s.a. - usa or maybe a dedicated
English acronyms filter. We could then later filter them out, even
not leaving holes, and have the same behavior that we have now?

Are there non-English examples where you would want the PhraseQuery to
match over punctuation...? EG, for Japanese, I assume we don't want
PhraseQuery applying across periods/commas, like it will now? (Not
sure about middle dot...? Others...?).

When Japanese (Kuromoji) tokenizer removes a punctuation token it should
leave a hole
-

Key: LUCENE-3940
URL: https://issues.apache.org/jira/browse/LUCENE-3940
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 4.0

Attachments: LUCENE-3940.patch, LUCENE-3940.patch

I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen 1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuation
token that's dropped.
In this case we should leave hole(s) so that the graph is intact, ie,
the graph should look the same as if the punctuation tokens were not
initially removed, but then a StopFilter had removed them.
This also affects tokens that have no compound over them, ie we fail
to leave a hole today when we remove the punctuation tokens.
I'm not sure this is serious enough to warrant fixing in 3.6 at the
last minute...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3930) nuke jars from source tree and use ivy

2012-03-30 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242520#comment-13242520
 ] 

Michael McCandless commented on LUCENE-3930:


bq. In my opinion this is ready to go into trunk. I'll wait a bit for any 
feedback though.

+1

ant test passes, after the one-time ant ivy-bootstrap.

Thanks everyone!

 nuke jars from source tree and use ivy
 --

 Key: LUCENE-3930
 URL: https://issues.apache.org/jira/browse/LUCENE-3930
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.6

 Attachments: LUCENE-3930-skip-sources-javadoc.patch, 
 LUCENE-3930-solr-example.patch, LUCENE-3930-solr-example.patch, 
 LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930.patch, 
 LUCENE-3930__ivy_bootstrap_target.patch, ant_-verbose_clean_test.out.txt, 
 noggit-commons-csv.patch, patch-jetty-build.patch


 As mentioned on the ML thread: switch jars to ivy mechanism?.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3932) Improve load time of .tii files

2012-03-30 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242541#comment-13242541
 ] 

Michael McCandless commented on LUCENE-3932:


{quote}
utf8 - utf 16 is 7% of the time
 utf 16 - utf8 is 16% of the time

writing vlong's is also 16% of the time, 
 TermBufer.read() is 17% of the time (24% if you include the call to 
utf8ToUtf16)
{quote}

Seems like if we made a direct decode tii file and write in-memory format 
(instead of going through SegmentTermEnum), we could get some of this back.  
The vLongs unfortunately need to be decoded/re-encoded because they are deltas 
in the file but absolutes in memory.  But, eg the vInt docFreq could be a 
copyVInt method instead of readVInt then writeVInt, which should save a bit.

bq. Trying with 3.4 gives a 4 second load time, most of the time spent in 
SegmentTermEnum.next().

OK, a bit faster than 3.5.  But presumably 3.4 uses much more RAM after 
startup...?

bq. Using the patch on trunk, load time goes from ~5 to ~2 seconds.

Awesome, thanks for testing!

 Improve load time of .tii files
 ---

 Key: LUCENE-3932
 URL: https://issues.apache.org/jira/browse/LUCENE-3932
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.5
 Environment: Linux
Reporter: Sean Bridges
 Attachments: LUCENE-3932.trunk.patch, perf.csv


 We have a large 50 gig index which is optimized as one segment, with a 66 MEG 
 .tii file.  This index has no norms, and no field cache.
 It takes about 5 seconds to load this index, profiling reveals that 60% of 
 the time is spent in GrowableWriter.set(index, value), and most of time in 
 set(...) is spent resizing PackedInts.Mutatable current.
 In the constructor for TermInfosReaderIndex, you initialize the writer with 
 the line,
 {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, 
 false);{quote}
 For our index using four as the bit estimate results in 27 resizes.
 The last value in indexToTerms is going to be ~ tiiFileLength, and if instead 
 you use,
 {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / 
 Math.log10(2));
 GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, 
 false);{quote}
 Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-30 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242550#comment-13242550
 ] 

Michael McCandless commented on LUCENE-3738:


Removing the asserts apparently didn't change the perf...

I can reproduce the slowdown in a separate test (before/after this commit):

{noformat}
TaskQPS base StdDev baseQPS vInt StdDev vInt  Pct 
diff
  IntNRQ7.110.896.730.58  -23% -   
17%
 Prefix3   16.070.96   15.650.72  -12% -
8%
Wildcard   20.140.91   19.670.77  -10% -
6%
PKLookup  154.625.08  151.112.82   -7% -
2%
  Fuzzy1   85.241.53   83.871.18   -4% -
1%
  Fuzzy2   44.111.03   43.960.44   -3% -
3%
SpanNear3.230.113.220.07   -5% -
5%
  TermBGroup1M1P   42.350.49   42.431.43   -4% -
4%
 Respell   65.111.91   65.271.27   -4% -
5%
  AndHighMed   54.184.04   54.502.27  -10% -   
13%
 TermGroup1M   31.270.35   31.460.63   -2% -
3%
TermBGroup1M   45.010.33   45.371.42   -3% -
4%
 AndHighHigh   13.350.71   13.460.50   -7% -   
10%
Term   82.713.12   83.562.33   -5% -
7%
   OrHighMed   10.660.67   10.780.44   -8% -   
12%
  OrHighHigh7.080.427.190.26   -7% -   
11%
SloppyPhrase5.110.245.200.31   -8% -   
13%
  Phrase   11.140.75   11.400.50   -8% -   
14%
{noformat}

But then Uwe made a patch (I'll attach) reducing the byte code for the
unrolled methods:

{noformat}
TaskQPS base StdDev baseQPS vInt StdDev vInt  Pct 
diff
SpanNear3.240.133.180.07   -7% -
4%
  Phrase   11.340.68   11.130.38  -10% -
7%
SloppyPhrase5.170.235.080.18   -9% -
6%
  TermBGroup1M1P   41.920.80   41.570.94   -4% -
3%
 TermGroup1M   30.740.68   30.810.96   -5% -
5%
Term   80.873.52   81.292.05   -6% -
7%
TermBGroup1M   43.940.93   44.171.32   -4% -
5%
  AndHighMed   53.712.62   54.211.97   -7% -
9%
 AndHighHigh   13.200.42   13.410.41   -4% -
8%
 Respell   65.372.70   66.533.29   -7% -   
11%
  Fuzzy1   84.292.11   86.443.36   -3% -
9%
PKLookup  149.814.20  153.879.46   -6% -   
12%
  OrHighHigh7.190.287.400.48   -7% -   
13%
   OrHighMed   10.820.43   11.160.73   -7% -   
14%
  Fuzzy2   43.720.96   45.242.03   -3% -   
10%
Wildcard   18.961.00   20.050.39   -1% -   
13%
 Prefix3   14.960.83   15.890.27   -1% -   
14%
  IntNRQ5.890.586.950.174% -   
34%
{noformat}

So... I think we should commit it!


 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
 Fix For: 3.6, 4.0

 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738.patch, 
 LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please

[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241328#comment-13241328
 ] 

Michael McCandless commented on LUCENE-3935:


+1

 Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
 ---

 Key: LUCENE-3935
 URL: https://issues.apache.org/jira/browse/LUCENE-3935
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3935.patch


 I've been profiling Kuromoji, and not very surprisingly, method 
 {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in 
 the Viterbi is called many many times and contributes to more processing time 
 than I had expected.
 This method is currently backed by a {{short[][]}}.  This data stored here 
 structure is a two dimensional array with both dimensions being fixed with 
 1316 elements in each dimension.  (The data is {{matrix.def}} in 
 MeCab-IPADIC.)
 We can rewrite this to use a single one-dimensional array instead, and we 
 will at least save one bounds check, a pointer reference, and we should also 
 get much better cache utilization since this structure is likely to be in 
 very local CPU cache.
 I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField

2012-03-29 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241354#comment-13241354
]

Michael McCandless commented on LUCENE-3312:

Hi Nikola,

I think this plus LUCENE-3891 sounds great! The challenge is... we need a
mentor for this project... volunteers?

Break out StorableField from IndexableField
---

Key: LUCENE-3312
URL: https://issues.apache.org/jira/browse/LUCENE-3312
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Reporter: Michael McCandless
Labels: gsoc2012, lucene-gsoc-12
Fix For: Field Type branch

In the field type branch we have strongly decoupled
Document/Field/FieldType impl from the indexer, by having only a
narrow API (IndexableField) passed to IndexWriter. This frees apps up
use their own documents instead of the user-space impls we provide
in oal.document.
Similarly, with LUCENE-3309, we've done the same thing on the
doc/field retrieval side (from IndexReader), with the
StoredFieldsVisitor.
But, maybe we should break out StorableField from IndexableField,
such that when you index a doc you provide two Iterables -- one for the
IndexableFields and one for the StorableFields. Either can be null.
One downside is possible perf hit for fields that are both indexed
stored (ie, we visit them twice, lookup their name in a hash twice,
etc.). But the upside is a cleaner separation of concerns in API

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241357#comment-13241357
 ] 

Michael McCandless commented on LUCENE-3907:


Awesome!  We just need a possible mentor here... volunteers...?

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3936) Rename StringIndexDocValues to DocTermsIndexDocValues

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241359#comment-13241359
 ] 

Michael McCandless commented on LUCENE-3936:


+1

 Rename StringIndexDocValues to DocTermsIndexDocValues
 -

 Key: LUCENE-3936
 URL: https://issues.apache.org/jira/browse/LUCENE-3936
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/other
Reporter: Martijn van Groningen
 Fix For: 4.0

 Attachments: LUCENE-3936.patch


 StringIndex doesn't exists any more in the trunk, so the name DocTermsIndex 
 should be used and this is also what it is using.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2000) Use covariant clone() return types

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241364#comment-13241364
 ] 

Michael McCandless commented on LUCENE-2000:


We now get a bunch of redundant cast warnings from this ... are there plans 
to fix that...?

 Use covariant clone() return types
 --

 Key: LUCENE-2000
 URL: https://issues.apache.org/jira/browse/LUCENE-2000
 Project: Lucene - Java
  Issue Type: Task
  Components: core/other
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Ryan McKinley
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2000-clone_covariance.patch, 
 LUCENE-2000-clone_covariance.patch


 *Paul Cowan wrote in LUCENE-1257:*
 OK, thought I'd jump in and help out here with one of my Java 5 favourites. 
 Haven't seen anyone discuss this, and don't believe any of the patches 
 address this, so thought I'd throw a patch out there (against SVN HEAD @ 
 revision 827821) which uses Java 5 covariant return types for (almost) all of 
 the Object#clone() implementations in core. 
 i.e. this:
 public Object clone() {
 changes to:
 public SpanNotQuery clone() {
 which lets us get rid of a whole bunch of now-unnecessary casts, so e.g.
 if (clone == null) clone = (SpanNotQuery) this.clone();
 becomes
 if (clone == null) clone = this.clone();
 Almost everything has been done and all downcasts removed, in core, with the 
 exception of
 Some SpanQuery stuff, where it's assumed that it's safe to cast the clone() 
 of a SpanQuery to a SpanQuery - this can't be made covariant without 
 declaring abstract SpanQuery clone() in SpanQuery itself, which breaks 
 those SpanQuerys that don't declare their own clone() 
 Some IndexReaders, e.g. DirectoryReader - we can't be more specific than 
 changing .clone() to return IndexReader, because it returns the result of 
 IndexReader.clone(boolean). We could use covariant types for THAT, which 
 would work fine, but that didn't follow the pattern of the others so that 
 could be a later commit. 
 Two changes were also made in contrib/, where not making the changes would 
 have broken code by trying to widen IndexInput#clone() back out to returning 
 Object, which is not permitted. contrib/ was otherwise left untouched.
 Let me know what you think, or if you have any other questions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3932) Improve load time of .tii files

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241386#comment-13241386
 ] 

Michael McCandless commented on LUCENE-3932:


I agree net/net that change is good; we know the in-RAM image will be at least 
as large as the tii file so we should make a better guess up front.

3.x is currently in code freeze (for the 3.6.0 release), but I'll commit to 
trunk's preflex codec.

Can you describe more about your index...?  If your tii fils is 66 MB, how many 
terms do you have...?  5 seconds is also a long startup time... what's the IO 
system like?

 Improve load time of .tii files
 ---

 Key: LUCENE-3932
 URL: https://issues.apache.org/jira/browse/LUCENE-3932
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.5
 Environment: Linux
Reporter: Sean Bridges

 We have a large 50 gig index which is optimized as one segment, with a 66 MEG 
 .tii file.  This index has no norms, and no field cache.
 It takes about 5 seconds to load this index, profiling reveals that 60% of 
 the time is spent in GrowableWriter.set(index, value), and most of time in 
 set(...) is spent resizing PackedInts.Mutatable current.
 In the constructor for TermInfosReaderIndex, you initialize the writer with 
 the line,
 {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, 
 false);{quote}
 For our index using four as the bit estimate results in 27 resizes.
 The last value in indexToTerms is going to be ~ tiiFileLength, and if instead 
 you use,
 {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / 
 Math.log10(2));
 GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, 
 false);{quote}
 Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2000) Use covariant clone() return types

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241417#comment-13241417
 ] 

Michael McCandless commented on LUCENE-2000:


Thanks Ryan!

 Use covariant clone() return types
 --

 Key: LUCENE-2000
 URL: https://issues.apache.org/jira/browse/LUCENE-2000
 Project: Lucene - Java
  Issue Type: Task
  Components: core/other
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Ryan McKinley
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2000-clone_covariance.patch, 
 LUCENE-2000-clone_covariance.patch


 *Paul Cowan wrote in LUCENE-1257:*
 OK, thought I'd jump in and help out here with one of my Java 5 favourites. 
 Haven't seen anyone discuss this, and don't believe any of the patches 
 address this, so thought I'd throw a patch out there (against SVN HEAD @ 
 revision 827821) which uses Java 5 covariant return types for (almost) all of 
 the Object#clone() implementations in core. 
 i.e. this:
 public Object clone() {
 changes to:
 public SpanNotQuery clone() {
 which lets us get rid of a whole bunch of now-unnecessary casts, so e.g.
 if (clone == null) clone = (SpanNotQuery) this.clone();
 becomes
 if (clone == null) clone = this.clone();
 Almost everything has been done and all downcasts removed, in core, with the 
 exception of
 Some SpanQuery stuff, where it's assumed that it's safe to cast the clone() 
 of a SpanQuery to a SpanQuery - this can't be made covariant without 
 declaring abstract SpanQuery clone() in SpanQuery itself, which breaks 
 those SpanQuerys that don't declare their own clone() 
 Some IndexReaders, e.g. DirectoryReader - we can't be more specific than 
 changing .clone() to return IndexReader, because it returns the result of 
 IndexReader.clone(boolean). We could use covariant types for THAT, which 
 would work fine, but that didn't follow the pattern of the others so that 
 could be a later commit. 
 Two changes were also made in contrib/, where not making the changes would 
 have broken code by trying to widen IndexInput#clone() back out to returning 
 Object, which is not permitted. contrib/ was otherwise left untouched.
 Let me know what you think, or if you have any other questions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1591) Enable bzip compression in benchmark

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241428#comment-13241428
 ] 

Michael McCandless commented on LUCENE-1591:


Note that enwiki-20110115-pages-articles.xml.bz2 also hits XERCESJ-1257 ...

 Enable bzip compression in benchmark
 

 Key: LUCENE-1591
 URL: https://issues.apache.org/jira/browse/LUCENE-1591
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9, 3.1, 4.0

 Attachments: LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, commons-compress-dev20090413.jar, 
 commons-compress-dev20090413.jar


 bzip compression can aid the benchmark package by not requiring extracting 
 bzip files (such as enwiki) in order to index them. The plan is to add a 
 config parameter bzip.compression=true/false and in the relevant tasks either 
 decompress the input file or compress the output file using the bzip streams.
 It will add a dependency on ant.jar which contains two classes similar to 
 GZIPOutputStream and GZIPInputStream which compress/decompress files using 
 the bzip algorithm.
 bzip is known to be superior in its compression performance to the gzip 
 algorithm (~20% better compression), although it does the 
 compression/decompression a bit slower.
 I wil post a patch which adds this parameter and implement it in 
 LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
 capability to DocMaker or some of the super classes, so it can be inherited 
 by all sub-classes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241438#comment-13241438
 ] 

Michael McCandless commented on LUCENE-3937:


LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug 
also happens on enwiki-20110115-pages-articles.xml.bz2 export.

Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead 
of Xercesj...

I'll test this patch now!


 Workaround the XERCES-J bug in Benchmark
 

 Key: LUCENE-3937
 URL: https://issues.apache.org/jira/browse/LUCENE-3937
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-3937.patch


 In becnhmark we have a patched version of XERCES which is hard to compile 
 from source. When looking at the code part patched and the source of 
 EnwikiContentSource, to simply provide the XML parser a Reader instead of 
 InputStream, so the broken code is not triggered. This assumes, that the 
 XML-file is always UTF-8 If not it will no longer work (because the XML 
 parser cannot switch encoding, if it only has a Reader).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241445#comment-13241445
 ] 

Michael McCandless commented on LUCENE-3937:


Note: I just run benchmark's conf/extractWikipedia.alg task on the XML 
export... when XERCESJ-1257 strikes you get this:
{noformat}
 ...
 [java]  936.83 sec -- main Wrote 2801000 line docs
 [java]  937.04 sec -- main Wrote 2802000 line docs
 [java]  937.27 sec -- main Wrote 2803000 line docs
 [java]  937.53 sec -- main Wrote 2804000 line docs
 [java]  937.79 sec -- main Wrote 2805000 line docs
 [java]  938.04 sec -- main Wrote 2806000 line docs
 [java]  938.35 sec -- main Wrote 2807000 line docs
 [java]  938.65 sec -- main Wrote 2808000 line docs
 [java]  938.88 sec -- main Wrote 2809000 line docs
 [java]  939.09 sec -- main Wrote 281 line docs
 [java]  939.09 sec -- main Wrote 281 line docs
 [java] Exception in thread Thread-0 java.lang.RuntimeException: 
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 
4-byte UTF-8 sequence.
 [java] at 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198)
 [java] at java.lang.Thread.run(Thread.java:619)
 [java] 
 [java] ###  D O N E !!! ###
 [java] Caused by: 
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 
4-byte UTF-8 sequence.
 [java] 
 [java] at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
 [java] at 
org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
 [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown 
Source)
 [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown 
Source)
 [java] at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
 [java] at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
 [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
Source)
 [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
Source)
 [java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 [java] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
Source)
 [java] at 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175)
 [java] ... 1 more
 [java] Caused by: 
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 
4-byte UTF-8 sequence.
 [java] at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown 
Source)
 [java] at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
 [java] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
 [java] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown 
Source)
 [java] at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
 [java] ... 8 more
{noformat}


 Workaround the XERCES-J bug in Benchmark
 

 Key: LUCENE-3937
 URL: https://issues.apache.org/jira/browse/LUCENE-3937
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-3937.patch


 In becnhmark we have a patched version of XERCES which is hard to compile 
 from source. When looking at the code part patched and the source of 
 EnwikiContentSource, to simply provide the XML parser a Reader instead of 
 InputStream, so the broken code is not triggered. This assumes, that the 
 XML-file is always UTF-8 If not it will no longer work (because the XML 
 parser cannot switch encoding, if it only has a Reader).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241486#comment-13241486
 ] 

Michael McCandless commented on LUCENE-3937:


OK with this patch the decode of enwiki-20110115 finished!

I agree we should tell the decoder to throw exception on any problems...

 Workaround the XERCES-J bug in Benchmark
 

 Key: LUCENE-3937
 URL: https://issues.apache.org/jira/browse/LUCENE-3937
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-3937.patch


 In becnhmark we have a patched version of XERCES which is hard to compile 
 from source. When looking at the code part patched and the source of 
 EnwikiContentSource, to simply provide the XML parser a Reader instead of 
 InputStream, so the broken code is not triggered. This assumes, that the 
 XML-file is always UTF-8 If not it will no longer work (because the XML 
 parser cannot switch encoding, if it only has a Reader).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3932) Improve load time of .tii files

2012-03-29 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241497#comment-13241497
 ] 

Michael McCandless commented on LUCENE-3932:


Nice.  I'd love to know how trunk handles all these terms (we have a more 
memory efficient terms dict/index in 4.0).

bq. After the change the big time waste is converting the terms from utf8 to 
utf16 when reading from the .tii file, and then back to utf8 when writing to 
the in memory store.

What %tg of the time is spent on the decode/encode (after fixing the initial 
bitEstimate)?

That is very silly... fixing that is a somewhat deeper change though.  I guess 
we'd need to read the .tii file directly (not use SegmentTermEnum), and then 
copy the UTF8 bytes straight without going through UTF16...

Do you have comparisons with pre-3.5 (before we cutover to this more 
RAM-efficient (but CPU heavy on load) terms index)?  Probably that less CPU on 
init, but more RAM held for the lifetime of the reader...?

 Improve load time of .tii files
 ---

 Key: LUCENE-3932
 URL: https://issues.apache.org/jira/browse/LUCENE-3932
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.5
 Environment: Linux
Reporter: Sean Bridges

 We have a large 50 gig index which is optimized as one segment, with a 66 MEG 
 .tii file.  This index has no norms, and no field cache.
 It takes about 5 seconds to load this index, profiling reveals that 60% of 
 the time is spent in GrowableWriter.set(index, value), and most of time in 
 set(...) is spent resizing PackedInts.Mutatable current.
 In the constructor for TermInfosReaderIndex, you initialize the writer with 
 the line,
 {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, 
 false);{quote}
 For our index using four as the bit estimate results in 27 resizes.
 The last value in indexToTerms is going to be ~ tiiFileLength, and if instead 
 you use,
 {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / 
 Math.log10(2));
 GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, 
 false);{quote}
 Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-03-28 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240527#comment-13240527
 ] 

Michael McCandless commented on LUCENE-3892:


That's great Han, I'll have a look.

I can be a mentor for this...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239447#comment-13239447
 ] 

Michael McCandless commented on SOLR-3282:
--

This sounds like a fabulous test!

I wonder if we can somehow make this easily runnable on demand (eg, like 
Test2BTerms), assuming you have the prereqs installed locally (eg Japanese 
Wikipedia export).

 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-27 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239667#comment-13239667
 ] 

Michael McCandless commented on LUCENE-3738:


+1 to remove those asserts... let's see if this fixes the slowdown the nightly 
builds hit on 3/18: http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html



 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
 Fix For: 3.6, 4.0

 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738.patch, 
 LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3076) Solr should support block joins

2012-03-26 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238239#comment-13238239
 ] 

Michael McCandless commented on SOLR-3076:
--

{quote}
2. Do you agree with overall approach to deliver straightforward QP with 
explicit joining syntax? Or you object and insist on entity-relationship-schema 
approach?

3. What's is the level of uncertainty you have about the current QP syntax? 
What's your main concern and what's the way to improve it?
{quote}

Well, stepping back, my concern is still that I don't think there
should be any QP syntax to express block joins.  These are joins
determined at indexing time, and compiled into the index, and so the
only remaining query-time freedom is which fields you want to search
against (something QP can already understand, ie field:text syntax).
From that fields list the required joins are implied.

I can't imagine users learning/typing the sort of syntax we are
discussing here.

It's true there are exceptional cases (Hoss's size field that's on
both parent and child docs), but, that's the exception not the rule; I
don't think we should design things (APIs, QP syntax) around exceptional
cases.  And, I think such an exception should be
handled by some sort of field aliasing (book_page_count vs
chapter_page_count).

For query-time join, which is fully flexible, I agree the QP must (and
already does) include join syntax, ie be more like SQL, where you can
express arbitrary on-the-fly joins.

But, at the same time, the 'users' of Solr's QP syntax may not be the
end user, ie, the app's front end may very well construct these
complex join expressions and so it's really the developers of that
search app writing these join queries.  So perhaps it's fine to add
crazy-expert syntax that end users would rarely use but search app
developers might...?

All this being said, I defer to Hoss (and other committers more
experienced w/ Solr QP issues) here... if they all feel this added QP
syntax makes sense then let's do it!


 Solr should support block joins
 ---

 Key: SOLR-3076
 URL: https://issues.apache.org/jira/browse/SOLR-3076
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
 Attachments: SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-3076.patch, SOLR-3076.patch, bjq-vs-filters-backward-disi.patch, 
 bjq-vs-filters-illegal-state.patch, child-bjqparser.patch, 
 parent-bjq-qparser.patch, parent-bjq-qparser.patch, 
 solrconf-bjq-erschema-snippet.xml, tochild-bjq-filtered-search-fix.patch


 Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3923) fail the build on wrong svn:eol-style

2012-03-26 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238355#comment-13238355
 ] 

Michael McCandless commented on LUCENE-3923:


+1

And, ideally, ant test as well...

 fail the build on wrong svn:eol-style
 -

 Key: LUCENE-3923
 URL: https://issues.apache.org/jira/browse/LUCENE-3923
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Reporter: Robert Muir

 I'm tired of fixing this before releases. Jenkins should detect and fail on 
 this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3873) tie MockGraphTokenFilter into all analyzers tests

2012-03-26 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238398#comment-13238398
 ] 

Michael McCandless commented on LUCENE-3873:


LUCENE-3848 has the MockGraphTokenFilter patch...

 tie MockGraphTokenFilter into all analyzers tests
 -

 Key: LUCENE-3873
 URL: https://issues.apache.org/jira/browse/LUCENE-3873
 Project: Lucene - Java
  Issue Type: Task
  Components: modules/analysis
Reporter: Robert Muir

 Mike made a MockGraphTokenFilter on LUCENE-3848.
 Many filters currently arent tested with anything but a simple tokenstream.
 we should test them with this, too, it might find bugs (zero-length terms,
 stacked terms/synonyms, etc)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3659) Improve Javadocs of RAMDirectory to document its limitations and add improvements to make it more GC friendly on large indexes

2012-03-26 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238402#comment-13238402
]

Michael McCandless commented on LUCENE-3659:

This looks great Uwe!

I'm a little worried about the tiny file case; you're checking for
SEGMENTS_* now, but many other files can be much smaller than 1/64th
of the estimated segment size.

I wonder if we should improve IOContext to hold the [rough]
estimated file size (not just overall segment size)... the thing is
that's sort of a hassle on codec impls.

Or: maybe, on closing the ROS/RAMFile, we can downsize the final
buffer (yes, this means copying the bytes, but that cost is vanishingly
small as the RAMDir grows). Then tiny files stay tiny, though they
are still [relatively] costly to create...

I don't this RAMDir.createOutput should publish the RAMFile until the
ROS is closed? Ie, you are not allowed to openInput on something
still opened with createOutput in any Lucene Dir impl..? This would
allow us to make RAMFile frozen (eg if ROS holds its own buffers and
then creates RAMFile on close), that requires no sync when reading?

I also don't think RAMFile should be public, ie, the only way to make
changes to a file stored in a RAMDir is via RAMOutputStream. We can
do this separately...

Maybe we should pursue a growing buffer size...? Ie, where each newly
added buffer is bigger than the one before (like ArrayUtil.oversize's
growth function)... I realize that adds complexity
(RAMInputStream.seek is more fun), but this would let tiny files use
tiny RAM and huge files use few buffers. Ie, RAMDir would scale up
and scale down well.

Separately: I noticed we still have IndexOutput.setLength, but, nobody
calls it anymore I think? (In 3.x we call this when creating a CFS).
Maybe we should remove it...

Improve Javadocs of RAMDirectory to document its limitations and add
improvements to make it more GC friendly on large indexes
--

Key: LUCENE-3659
URL: https://issues.apache.org/jira/browse/LUCENE-3659
Project: Lucene - Java
Issue Type: Task
Affects Versions: 3.5, 4.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Fix For: 3.6, 4.0

Attachments: LUCENE-3659.patch, LUCENE-3659.patch, LUCENE-3659.patch

Spinoff from several dev@lao issues:
-
[http://mail-archives.apache.org/mod_mbox/lucene-dev/201112.mbox/%3C001001ccbf1c%2471845830%24548d0890%24%40thetaphi.de%3E]
- issue LUCENE-3653
The use cases for RAMDirectory are very limited and to prevent users from
using it for e.g. loading a 50 Gigabyte index from a file on disk, we should
improve the javadocs.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3873) tie MockGraphTokenFilter into all analyzers tests

2012-03-26 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238422#comment-13238422
 ] 

Michael McCandless commented on LUCENE-3873:


I agree we can use it in specific places for starters...

The patch on LUCENE-3848 mixes in TokenStream to Automaton and 
MockGraphTokenFilter; I'll split that apart and only commit 
MockGraphTokenFilter here.

One problem is... MockGraphTokenFilter isn't setting offsets currently I 
think to do this correctly it needs to buffer up pending input tokens, until 
it's reached the posLength it wants to output for a random token, and then set 
the offset accordingly.

 tie MockGraphTokenFilter into all analyzers tests
 -

 Key: LUCENE-3873
 URL: https://issues.apache.org/jira/browse/LUCENE-3873
 Project: Lucene - Java
  Issue Type: Task
  Components: modules/analysis
Reporter: Robert Muir
Assignee: Michael McCandless

 Mike made a MockGraphTokenFilter on LUCENE-3848.
 Many filters currently arent tested with anything but a simple tokenstream.
 we should test them with this, too, it might find bugs (zero-length terms,
 stacked terms/synonyms, etc)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1410) PFOR implementation

2012-03-25 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237893#comment-13237893
 ] 

Michael McCandless commented on LUCENE-1410:


bq. Out of curiousity, is the PFOR effort dead? 

Nothing in open source is ever dead!  (Well, rarely...).  It's just that nobody 
has picked this up again and pushed it to a committable state.

I think now that we have no more bulk API in trunk, it may not be that much 
work to finish... though there could easily be surprises.

I opened LUCENE-3892 to do exactly this, as a Google Summer of Code project.

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, 
 LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, 
 TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java, 
 autogen.tgz, for-summary.txt

   Original Estimate: 21,840h
  Remaining Estimate: 21,840h

 Implementation of Patched Frame of Reference.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3581) IndexReader#isCurrent() should return true on a NRT reader if no deletes are applied and only deletes are present in IW

2012-03-25 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237926#comment-13237926
 ] 

Michael McCandless commented on LUCENE-3581:


This need not block 3.6.0 right?

We are returning false when we could return true from isCurrent, but this 
just means the app will go through the reopen when it didn't have to...?  Ie 
relatively minor?

 IndexReader#isCurrent() should return true on a NRT reader if no deletes are 
 applied and only deletes are present in IW
 ---

 Key: LUCENE-3581
 URL: https://issues.apache.org/jira/browse/LUCENE-3581
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.5, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 3.6, 4.0


 I keep forgetting about this, I better open an issue. If you have a NRT 
 reader without deletes applied it should infact return true on IR#isCurrent() 
 if the IW only has deletes in its buffer ie. no documents where updated / 
 added since the NRT reader was opened. Currently if there is a delete coming 
 in we force a reopen which does nothing since deletes are not applied anyway.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3919) more thorough testing of analysis chains

2012-03-25 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237994#comment-13237994
 ] 

Michael McCandless commented on LUCENE-3919:


Awesome!

 more thorough testing of analysis chains
 

 Key: LUCENE-3919
 URL: https://issues.apache.org/jira/browse/LUCENE-3919
 Project: Lucene - Java
  Issue Type: Task
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3919.patch


 In lucene we essentially test each analysis component separately. we also 
 give some
 good testing to the example Analyzers we provide that combine them.
 But we don't test various combinations that are possible: which is bad because
 it doesnt test possibilities for custom analyzers (especially since lots of 
 solr users
 etc define their own).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3909) Move Kuromoji to analysis.ja and introduce Japanese* naming

2012-03-24 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237494#comment-13237494
 ] 

Michael McCandless commented on LUCENE-3909:


+1


 Move Kuromoji to analysis.ja and introduce Japanese* naming
 ---

 Key: LUCENE-3909
 URL: https://issues.apache.org/jira/browse/LUCENE-3909
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen

 Lucene/Solr 3.6 and 4.0 will get out-of-the-box Japanese language support 
 through {{KuromojiAnalyzer}}, {{KuromojiTokenizer}} and various other 
 filters.  These filters currently live in 
 {{org.apache.lucene.analysis.kuromoji}}.
 I'm proposing that we move Kuromoji to a new Japanese package 
 {{org.apache.lucene.analysis.ja}} in line with how other languages are 
 organized.  As part of this, I also think we should rename 
 {{KuromojiAnalyzer}} to {{JapaneseAnalyzer}}, etc. to further align naming to 
 our conventions by making it very clear that these analyzers are for 
 Japanese.  (As much as I like the name Kuromoji, I think Japanese is more 
 fitting.)
 A potential issue I see with this that I'd like to raise and get feedback on, 
 is that end-users in Japan and elsewhere who use lucene-gosen could have 
 issues after an upgrade since lucene-gosen is in fact releasing its analyzers 
 under the {{org.apache.lucene.analysis.ja}} namespace (and we'd have a name 
 clash).
 I believe users should have the freedom to choose whichever Japanese 
 analyzer, filter, etc. they'd like to use, and I don't want to propose a name 
 change that just creates unnecessary problems for users, but I think the 
 naming proposed above is most fitting for a Lucene/Solr release.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

2012-03-24 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237545#comment-13237545
 ] 

Michael McCandless commented on LUCENE-3913:


I forgot to say: patch is against 3.x.


 HTMLStripCharFilter produces invalid final offset
 -

 Key: LUCENE-3913
 URL: https://issues.apache.org/jira/browse/LUCENE-3913
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3913.patch


 Nightly build found this... I boiled it down to a small test case that 
 doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

2012-03-24 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237547#comment-13237547
 ] 

Michael McCandless commented on LUCENE-3911:


Looks great!

 improve BaseTokenStreamTestCase random string generation
 

 Key: LUCENE-3911
 URL: https://issues.apache.org/jira/browse/LUCENE-3911
 Project: Lucene - Java
  Issue Type: Task
  Components: general/test
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3911.patch, LUCENE-3911.patch


 Most analysis tests use mocktokenizer (which splits on whitespace), but
 its rare that we generate a string with 'many tokens'. So I think we should
 try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

2012-03-24 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237548#comment-13237548
 ] 

Michael McCandless commented on LUCENE-3913:


Good idea!  I'll fix that test case.

Here's the failure output:

{noformat}
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest 
-Dtestmethod=testOddHTMLString 
-Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 
-Dargs=-Dfile.encoding=UTF-8
[junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest 
-Dtestmethod=null 
-Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 
-Dargs=-Dfile.encoding=UTF-8
[junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk
[junit] NOTE: all tests run in this JVM:
[junit] [HTMLStripCharFilterTest]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864
[junit] -  ---
[junit] Testcase: 
testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest):
 FAILED
[junit] finalOffset  expected:20 but was:19
[junit] junit.framework.AssertionFailedError: finalOffset  expected:20 
but was:19
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975)
[junit] at junit.framework.TestResult.addError(TestResult.java:38)
[junit] at 
junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51)
[junit] at 
org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100)
[junit] at 
org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41)
[junit] at 
org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97)
[junit] at 
org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26)
[junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267)
[junit] at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146)
[junit] at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
[junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
[junit] at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
[junit] at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
[junit] at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
[junit] at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
[junit] at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
[junit] at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
[junit] at 
org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
[junit] at 
org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36)
[junit] at 
org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
[junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
[junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
[junit] at 
junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
[junit] Caused by: java.lang.AssertionError: finalOffset  expected:20 but 
was:19
[junit] at org.junit.Assert.fail(Assert.java:93)
[junit] at org.junit.Assert.failNotEquals(Assert.java:647)
[junit] at org.junit.Assert.assertEquals(Assert.java:128)
[junit] at org.junit.Assert.assertEquals(Assert.java:472)
[junit] at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182)
[junit] at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574)
[junit] at 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at

[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

2012-03-24 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237652#comment-13237652
 ] 

Michael McCandless commented on LUCENE-3913:


Awesome, thanks Steve!

 HTMLStripCharFilter produces invalid final offset
 -

 Key: LUCENE-3913
 URL: https://issues.apache.org/jira/browse/LUCENE-3913
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Steven Rowe
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3913.patch, LUCENE-3913.patch


 Nightly build found this... I boiled it down to a small test case that 
 doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3268) remove write acess to source tree (chmod 555) when running tests in jenkins

2012-03-23 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236500#comment-13236500
 ] 

Michael McCandless commented on SOLR-3268:
--

+1

 remove write acess to source tree (chmod 555) when running tests in jenkins
 ---

 Key: SOLR-3268
 URL: https://issues.apache.org/jira/browse/SOLR-3268
 Project: Solr
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 3.6, 4.0


 Some tests are currently creating files under the source tree.
 This causes a lot of problems, it makes my checkout look dirty after running 
 'ant test' and i have to cleanup.
 I opened and issue for this a month in a half for 
 solrj/src/test-files/solrj/solr/shared/test-solr.xml (SOLR-3112), 
 but now we have a second file 
 (core/src/test-files/solr/conf/elevate-data-distrib.xml).
 So I think hudson needs to chmod these src directories to 555, so that solr 
 tests that do this will fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content

2012-03-23 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236806#comment-13236806
 ] 

Michael McCandless commented on LUCENE-3905:


The ngram filters are unfortunately not OK: they use up tons of RAM when you 
send random/big tokens through them, because they don't have the same 1024 
character limit... I think we should open a new issue for them... in fact I 
think repairing them could make a good GSoC!

 BaseTokenStreamTestCase should test analyzers on real-ish content
 -

 Key: LUCENE-3905
 URL: https://issues.apache.org/jira/browse/LUCENE-3905
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-3905.patch


 We already have LineFileDocs, that pulls content generated from europarl or 
 wikipedia... I think sometimes BTSTC should test the analyzers on that as 
 well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content

2012-03-23 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236820#comment-13236820
 ] 

Michael McCandless commented on LUCENE-3905:


OK I opened LUCENE-3907 for ngram love...

 BaseTokenStreamTestCase should test analyzers on real-ish content
 -

 Key: LUCENE-3905
 URL: https://issues.apache.org/jira/browse/LUCENE-3905
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-3905.patch


 We already have LineFileDocs, that pulls content generated from europarl or 
 wikipedia... I think sometimes BTSTC should test the analyzers on that as 
 well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3897) KuromojiTokenizer fails with large docs

2012-03-22 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235493#comment-13235493
 ] 

Michael McCandless commented on LUCENE-3897:


Thanks Christian!

 KuromojiTokenizer fails with large docs
 ---

 Key: LUCENE-3897
 URL: https://issues.apache.org/jira/browse/LUCENE-3897
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
Assignee: Christian Moen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3897.patch


 just shoving largeish random docs triggers asserts like:
 {noformat}
 [junit] Caused by: java.lang.AssertionError: backPos=4100 vs 
 lastBackTracePos=5120
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:907)
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:756)
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:403)
 [junit]   at 
 org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:404)
 {noformat}
 But, you get no seed...
 I'll commit the test case and @Ignore it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html

2012-03-22 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235574#comment-13235574
 ] 

Michael McCandless commented on LUCENE-3887:


You can also just run the javadoc checker directly in a source checkout, like 
this:
{noformat}
  python -u dev-tools/scripts/checkJavaDocs.py /lucene/3x/lucene/build
{noformat}

You have to ant javadocs first yourself.

Right now it only checks for missing sentences in the package-summary.html... 
I'll see if I can fix it to also detect missing package.html's...

Here's what it reports on 3.x right now:
{noformat}
/lucene/3x/lucene/build/docs/api/contrib-highlighter/org/apache/lucene/search/highlight/package-summary.html
  missing: TokenStreamFromTermPositionVector

/lucene/3x/lucene/build/docs/api/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html
  missing: BoundaryScanner
  missing: BaseFragmentsBuilder
  missing: FieldFragList.WeightedFragInfo
  missing: FieldFragList.WeightedFragInfo.SubInfo
  missing: FieldPhraseList.WeightedPhraseInfo
  missing: FieldPhraseList.WeightedPhraseInfo.Toffs
  missing: FieldQuery.QueryPhraseMap
  missing: FieldTermStack.TermInfo
  missing: ScoreOrderFragmentsBuilder.ScoreComparator
  missing: SimpleBoundaryScanner

/lucene/3x/lucene/build/docs/api/contrib-spatial/org/apache/lucene/spatial/tier/package-summary.html
  missing: DistanceHandler.Precision

/lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/package-summary.html
  missing: Lookup.LookupPriorityQueue

/lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/jaspell/package-summary.html
  missing: JaspellLookup

/lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/tst/package-summary.html
  missing: TSTAutocomplete
  missing: TSTLookup

/lucene/3x/lucene/build/docs/api/contrib-pruning/org/apache/lucene/index/pruning/package-summary.html
  missing: CarmelTopKTermPruningPolicy.ByDocComparator
  missing: CarmelUniformTermPruningPolicy.ByDocComparator

/lucene/3x/lucene/build/docs/api/contrib-facet/org/apache/lucene/facet/taxonomy/writercache/lru/package-summary.html
  missing: LruTaxonomyWriterCache.LRUType

/lucene/3x/lucene/build/docs/api/contrib-facet/org/apache/lucene/facet/index/package-summary.html
  missing: FacetsPayloadProcessorProvider.FacetsDirPayloadProcessor

/lucene/3x/lucene/build/docs/api/core/org/apache/lucene/store/package-summary.html
  missing: FSDirectory.FSIndexOutput
  missing: NIOFSDirectory.NIOFSIndexInput
  missing: RAMFile
  missing: SimpleFSDirectory.SimpleFSIndexInput
  missing: SimpleFSDirectory.SimpleFSIndexInput.Descriptor

/lucene/3x/lucene/build/docs/api/core/org/apache/lucene/index/package-summary.html
  missing: MergePolicy.MergeAbortedException

/lucene/3x/lucene/build/docs/api/core/org/apache/lucene/search/package-summary.html
  missing: FieldCache.CreationPlaceholder
  missing: FieldComparator.NumericComparatorlt;T extends Numbergt;
  missing: FieldValueHitQueue.Entry
  missing: QueryTermVector
  missing: ScoringRewritelt;Q extends Querygt;
  missing: SpanFilterResult.PositionInfo
  missing: SpanFilterResult.StartEnd
  missing: TimeLimitingCollector.TimerThread

/lucene/3x/lucene/build/docs/api/core/org/apache/lucene/util/package-summary.html
  missing: ByteBlockPool.Allocator
  missing: ByteBlockPool.DirectAllocator
  missing: ByteBlockPool.DirectTrackingAllocator
  missing: BytesRefHash.BytesStartArray
  missing: BytesRefHash.DirectBytesStartArray
  missing: BytesRefIterator.EmptyBytesRefIterator
  missing: DoubleBarrelLRUCache.CloneableKey
  missing: OpenBitSetDISI
  missing: PagedBytes.Reader
  missing: UnicodeUtil.UTF16Result
  missing: UnicodeUtil.UTF8Result

/lucene/3x/lucene/build/docs/api/contrib-analyzers/org/tartarus/snowball/package-summary.html
  missing: Among
  missing: TestApp

/lucene/3x/lucene/build/docs/api/contrib-xml-query-parser/org/apache/lucene/xmlparser/package-summary.html
  missing: FilterBuilder
  missing: CorePlusExtensionsParser
  missing: DOMUtils
  missing: FilterBuilderFactory
  missing: QueryBuilderFactory
  missing: ParserException

/lucene/3x/lucene/build/docs/api/contrib-xml-query-parser/org/apache/lucene/xmlparser/builders/package-summary.html
  missing: SpanQueryBuilder
  missing: BooleanFilterBuilder
  missing: BooleanQueryBuilder
  missing: BoostingQueryBuilder
  missing: BoostingTermBuilder
  missing: ConstantScoreQueryBuilder
  missing: DuplicateFilterBuilder
  missing: FilteredQueryBuilder
  missing: FuzzyLikeThisQueryBuilder
  missing: LikeThisQueryBuilder
  missing: MatchAllDocsQueryBuilder
  missing: RangeFilterBuilder
  missing: SpanBuilderBase
  missing: SpanFirstBuilder
  missing: SpanNearBuilder
  missing: SpanNotBuilder
  missing: SpanOrBuilder
  missing: SpanOrTermsBuilder
  missing: SpanQueryBuilderFactory
  missing: SpanTermBuilder

[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html

2012-03-22 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235635#comment-13235635
 ] 

Michael McCandless commented on LUCENE-3887:


OK I committed the basic checking for smoke tester...

I'll leave this open for having ant javadocs fail when things are missing...

 'ant javadocs' should fail if a package is missing a package.html
 -

 Key: LUCENE-3887
 URL: https://issues.apache.org/jira/browse/LUCENE-3887
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Reporter: Robert Muir
 Attachments: LUCENE-3887.patch, LUCENE-3887.patch


 While reviewing the javadocs I noticed many packages are missing a basic 
 package.html.
 For 3.x I committed some package.html files where they were missing (I will 
 port forward to trunk).
 I think all packages should have this... really all public/protected 
 classes/methods/constants,
 but this would be a good step.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3898) possible SynonymFilter bug: hudson fail

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234263#comment-13234263
 ] 

Michael McCandless commented on LUCENE-3898:


I can't provoke this failure yet... (just beasting the test).

 possible SynonymFilter bug: hudson fail
 ---

 Key: LUCENE-3898
 URL: https://issues.apache.org/jira/browse/LUCENE-3898
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
Assignee: Michael McCandless

 See https://builds.apache.org/job/Lucene-trunk/1867/consoleText (no seed)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3896) CharTokenizer has bugs for large documents.

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234397#comment-13234397
 ] 

Michael McCandless commented on LUCENE-3896:


Thanks Rob!

 CharTokenizer has bugs for large documents.
 ---

 Key: LUCENE-3896
 URL: https://issues.apache.org/jira/browse/LUCENE-3896
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
Priority: Blocker
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch


 Initially found by hudson from additional testing added in LUCENE-3894, but 
 currently not reproducable (see LUCENE-3895).
 But its easy to reproduce for a simple single-threaded case in 
 TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3899) Evil up MockDirectoryWrapper.checkIndexOnClose

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234400#comment-13234400
 ] 

Michael McCandless commented on LUCENE-3899:


+1

More evilness!

 Evil up MockDirectoryWrapper.checkIndexOnClose
 --

 Key: LUCENE-3899
 URL: https://issues.apache.org/jira/browse/LUCENE-3899
 Project: Lucene - Java
  Issue Type: Test
Reporter: Robert Muir
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3899.patch


 MockDirectoryWrapper checks any indexes tests create on close(), if they 
 exist.
 The problem is the logic it uses to determine if an index exists could mask 
 real bugs (e.g. segments file corrumption):
 {code}
 if (DirectoryReader.indexExists(this) {
   ...
   // evil stuff like crash()
   ...
   _TestUtil.checkIndex(this)
 }
 {code}
 and for reference DirectoryReader.indexExists is:
 {code}
 try {
   new SegmentInfos().read(directory);
   return true;
 } catch (IOException ioe) {
   return false;
 }
 {code}
 So if there are segments file problems, we just silently do no checkIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3778) Create a grouping convenience class

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234445#comment-13234445
 ] 

Michael McCandless commented on LUCENE-3778:


+1

 Create a grouping convenience class
 ---

 Key: LUCENE-3778
 URL: https://issues.apache.org/jira/browse/LUCENE-3778
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/grouping
Reporter: Martijn van Groningen
 Fix For: 4.0

 Attachments: LUCENE-3778.patch, LUCENE-3778.patch, LUCENE-3778.patch, 
 LUCENE-3778.patch


 Currently the grouping module has many collector classes with a lot of 
 different options per class. I think it would be a good idea to have a 
 GroupUtil (Or another name?) convenience class. I think this could be a 
 builder, because of the many options 
 (sort,sortWithinGroup,groupOffset,groupCount and more) and implementations 
 (term/dv/function) grouping has.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3900) Make BaseTokenStreamTestCase.checkRandomData more debuggable

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234446#comment-13234446
 ] 

Michael McCandless commented on LUCENE-3900:


+1!

 Make BaseTokenStreamTestCase.checkRandomData more debuggable
 

 Key: LUCENE-3900
 URL: https://issues.apache.org/jira/browse/LUCENE-3900
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir

 This thing has gotten meaner recently, but if it fails, it can be tough to 
 debug.
 I feel like usually we just look at whatever analyzer failed, and completely 
 review the code
 and look for any smells until it passes :)
 So I think instead we can possibly make this easier if this does something 
 like:
 {code}
 try { 
  ...checks... 
 } catch (Throwable t) { 
   BaseTokenException e = new BaseTokenException(randomInputUsed, 
 randomParamter1, randomParameter2); 
   e.setInitCause(t); 
   throw e; 
 }
 {code}
 Then you could have a useful exception with the input string that caused the 
 fail,
 information about whether or not charfilter/mockreaderwrapper/whatever were 
 used, etc,
 as well as the initial problem as root cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3897) KuromojiTokenizer fails with large docs

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234455#comment-13234455
 ] 

Michael McCandless commented on LUCENE-3897:


I think the problem is when we force a backtrace (if it's = 1024 chars since 
the last backtrace)... I think we are not correctly pruning all paths in this 
case.

Unlike the natural backtrace, which happens whenever there is only 1 path (ie 
the parsing is unambiguous from that point backwards), the forced backtrace may 
have more than one live path.

Have to mull how to fix...

 KuromojiTokenizer fails with large docs
 ---

 Key: LUCENE-3897
 URL: https://issues.apache.org/jira/browse/LUCENE-3897
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Fix For: 3.6, 4.0


 just shoving largeish random docs triggers asserts like:
 {noformat}
 [junit] Caused by: java.lang.AssertionError: backPos=4100 vs 
 lastBackTracePos=5120
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:907)
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:756)
 [junit]   at 
 org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:403)
 [junit]   at 
 org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:404)
 {noformat}
 But, you get no seed...
 I'll commit the test case and @Ignore it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2788) Make CharFilter reusable

2012-03-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234556#comment-13234556
 ] 

Michael McCandless commented on LUCENE-2788:


+1

I really like the approach here (just using FilterReader instead of our own new 
class).

Since the back-compat is going be tricky... maybe we should first commit this 
patch to trunk?

 Make CharFilter reusable
 

 Key: LUCENE-2788
 URL: https://issues.apache.org/jira/browse/LUCENE-2788
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-2788.patch


 The CharFilter API lets you wrap a Reader, altering the contents before the 
 Tokenizer sees them.
 It also allows you to correct the offsets so this is transparent to 
 highlighting.
 One problem is that the API isn't reusable, if you have a lot of short 
 documents its going to be efficient.
 Additionally there is some unnecessary wrapping in Tokenizer (see the 
 CharReader.get in the ctor, but *not* in reset(Reader)!!!)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3893) TermsFilter should use AutomatonQuery

2012-03-20 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233522#comment-13233522
 ] 

Michael McCandless commented on LUCENE-3893:


LUCENE-3832 should also be done for this...

 TermsFilter should use AutomatonQuery
 -

 Key: LUCENE-3893
 URL: https://issues.apache.org/jira/browse/LUCENE-3893
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12

 I think we could see perf gains if TermsFilter sorted the terms, built a 
 minimal automaton, and used TermsEnum.intersect to visit the terms...
 This idea came up on the dev list recently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html

2012-03-20 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233544#comment-13233544
 ] 

Michael McCandless commented on LUCENE-3887:


+1

It shouldn't be the RM who must do this on release...

 'ant javadocs' should fail if a package is missing a package.html
 -

 Key: LUCENE-3887
 URL: https://issues.apache.org/jira/browse/LUCENE-3887
 Project: Lucene - Java
  Issue Type: Task
  Components: general/build
Reporter: Robert Muir

 While reviewing the javadocs I noticed many packages are missing a basic 
 package.html.
 For 3.x I committed some package.html files where they were missing (I will 
 port forward to trunk).
 I think all packages should have this... really all public/protected 
 classes/methods/constants,
 but this would be a good step.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3889) Remove/Uncommit SegmentingTokenizerBase

2012-03-20 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233545#comment-13233545
 ] 

Michael McCandless commented on LUCENE-3889:


+1

 Remove/Uncommit SegmentingTokenizerBase
 ---

 Key: LUCENE-3889
 URL: https://issues.apache.org/jira/browse/LUCENE-3889
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3889.patch


 I added this class in LUCENE-3305 to support analyzers like Kuromoji,
 but Kuromoji no longer needs it as of LUCENE-3767. So now nothing uses it.
 I think we should uncommit before releasing, svn doesn't forget so
 we can add this back if we want to refactor something like Thai or Smartcn
 to use it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

2012-03-20 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233886#comment-13233886
 ] 

Michael McCandless commented on LUCENE-3894:


I think that new read method needs to use the incoming offset (ie, pass 
location + offset, not location, as 2nd arg to input.read)?  Does testHugeDoc 
then pass?

 Make BaseTokenStreamTestCase a bit more evil
 

 Key: LUCENE-3894
 URL: https://issues.apache.org/jira/browse/LUCENE-3894
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch


 Throw an exception from the Reader while tokenizing, stop after not consuming 
 all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

2012-03-20 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233945#comment-13233945
 ] 

Michael McCandless commented on LUCENE-3894:


Thanks Rob!

 Make BaseTokenStreamTestCase a bit more evil
 

 Key: LUCENE-3894
 URL: https://issues.apache.org/jira/browse/LUCENE-3894
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch


 Throw an exception from the Reader while tokenizing, stop after not consuming 
 all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3076) Solr should support block joins

2012-03-19 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232938#comment-13232938
 ] 

Michael McCandless commented on SOLR-3076:
--

Hi Mikhail, I've committed fixes for the filtering issues you found in 
ToChildBJQ, I think...?  Are you still seeing issues?

I'm unsure of the QP syntax for BJQ...

 Solr should support block joins
 ---

 Key: SOLR-3076
 URL: https://issues.apache.org/jira/browse/SOLR-3076
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
 Attachments: SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-3076.patch, SOLR-3076.patch, bjq-vs-filters-backward-disi.patch, 
 bjq-vs-filters-illegal-state.patch, child-bjqparser.patch, 
 parent-bjq-qparser.patch, parent-bjq-qparser.patch, 
 solrconf-bjq-erschema-snippet.xml, tochild-bjq-filtered-search-fix.patch


 Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-18 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232272#comment-13232272
 ] 

Michael McCandless commented on LUCENE-3738:


bq. In my opinion, we should unroll all readVInt/readVLong loops so all behave 
100% identical! 

+1


 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-18 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232336#comment-13232336
 ] 

Michael McCandless commented on LUCENE-3738:


+1

Looks awesome Uwe!


 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, 
 LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-17 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231962#comment-13231962
]

Michael McCandless commented on LUCENE-3738:

bq. The check is only ommitted in the unrolled loop, the for-loop still
contains the check.

I'm confused... I don't see how/where BufferedIndexInput.readVLong is
checking for negative result now...? Are you proposing adding an if
into that method? That's what I don't want to do... eg, readVLong is
called 3 times per term we decode (Lucene40 codec); it's a very low
level API... other codecs may very well call it more often. I don't
think we should add an if inside BII.readVLong.

Or maybe you are saying you just want the unrolled code to handle
the negative vLong case (ie, unroll the currently missing 10th cycle),
and not add an if to BufferedIndexInput.readVLong? And then for
free we can add a real if (not assert) if that 10th cycle is hit?
(ie, if we get to that 10th byte, throw an exception). I think that
makes sense!

bq. there are other asserts in the index readiung code at places completely
outside any loops, executed only once when index is opened.

+1 to make those real checks, as long as the cost is vanishingly
small.

bq. which is also a security issue when you e.g. download indexes through
network connections and a man in the middle modifies the stream.

I don't think it's our job to protect against / detect that.

bq. Disk IO can produce wrong data.

True, but all bets are off if that happens: you're gonna get all sorts
of crazy exceptions out of Lucene. We are not a filesystem.

Be consistent about negative vInt/vLong
---

Key: LUCENE-3738
URL: https://issues.apache.org/jira/browse/LUCENE-3738
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless
Assignee: Uwe Schindler
Fix For: 3.6, 4.0

Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3870) VarDerefBytesImpl doc values prefix length may fall across two pages

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231033#comment-13231033
 ] 

Michael McCandless commented on LUCENE-3870:


+1, looks good Simon!

Just remember to remove that sop...

 VarDerefBytesImpl doc values prefix length may fall across two pages
 

 Key: LUCENE-3870
 URL: https://issues.apache.org/jira/browse/LUCENE-3870
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3870.patch, LUCENE-3870.patch


 The VarDerefBytesImpl doc values encodes the unique byte[] with prefix (1 or 
 2 bytes) first, followed by bytes, so that it can use 
 PagedBytes.fillSliceWithPrefix.
 It does this itself rather than using PagedBytes.copyUsingLengthPrefix...
 The problem is, it can write an invalid 2 byte prefix spanning two blocks 
 (ie, last byte of block N and first byte of block N+1), which 
 fillSliceWithPrefix won't decode correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3876) TestIndexWriterExceptions fails (reproducible)

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231035#comment-13231035
 ] 

Michael McCandless commented on LUCENE-3876:


Hmm I think we need a separate check in FreqProxTermsWriterPerField?

Ie, that class is private to the indexing chain; it's a like a codec, that's 
used to buffer postings in RAM until we write them to the real codec, and in 
theory an app could swap in a different indexing chain that didn't steal a bit 
from the posDelta...

 TestIndexWriterExceptions fails (reproducible)
 --

 Key: LUCENE-3876
 URL: https://issues.apache.org/jira/browse/LUCENE-3876
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Dawid Weiss
Priority: Minor
 Fix For: 4.0


 {noformat}
 ant test -Dtestcase=TestIndexWriterExceptions 
 -Dtestmethod=testIllegalPositions 
 -Dtests.seed=-228094d3d2f35cf2:-496e33eec9bbd57c:36a1c54f4e1bb32 
 -Dargs=-Dfile.encoding=UTF-8
 [junit] junit.framework.AssertionFailedError: position=-2 lastPosition=0
 [junit] at 
 org.apache.lucene.codecs.lucene40.Lucene40PostingsWriter.addPosition(Lucene40PostingsWriter.java:215)
 [junit] at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:519)
 [junit] at 
 org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:92)
 [junit] at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
 [junit] at 
 org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
 [junit] at 
 org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
 [junit] at 
 org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:475)
 [junit] at 
 org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
 [junit] at 
 org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553)
 [junit] at 
 org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2640)
 [junit] at 
 org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2616)
 [junit] at 
 org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:851)
 [junit] at 
 org.apache.lucene.index.IndexWriter.close(IndexWriter.java:810)
 [junit] at 
 org.apache.lucene.index.IndexWriter.close(IndexWriter.java:774)
 [junit] at 
 org.apache.lucene.index.TestIndexWriterExceptions.testIllegalPositions(TestIndexWriterExceptions.java:1517)
 [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 [junit] at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 [junit] at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [junit] at java.lang.reflect.Method.invoke(Method.java:597)
 [junit] at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
 [junit] at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 [junit] at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
 [junit] at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
 [junit] at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 [junit] at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:729)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:645)
 [junit] at 
 org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:556)
 [junit] at 
 org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:618)
 [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
 [junit] at 
 org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
 [junit] at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
 [junit] at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:164)
 [junit] at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
 [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
 [junit] at

[jira] [Commented] (LUCENE-3877) Lucene should not call System.out.println

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231090#comment-13231090
 ] 

Michael McCandless commented on LUCENE-3877:


I think it's fine if tests write to the std streams, but not core Lucene code 
(lucene/core/src/java/*)?

 Lucene should not call System.out.println
 -

 Key: LUCENE-3877
 URL: https://issues.apache.org/jira/browse/LUCENE-3877
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


 We seem to have accumulated a few random sops...
 Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least.
 Can we somehow detect (eg, have a test failure) if we accidentally leave 
 errant System.out.println's (leftover from debugging)...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3876) TestIndexWriterExceptions fails (reproducible)

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231107#comment-13231107
 ] 

Michael McCandless commented on LUCENE-3876:


+1

 TestIndexWriterExceptions fails (reproducible)
 --

 Key: LUCENE-3876
 URL: https://issues.apache.org/jira/browse/LUCENE-3876
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Dawid Weiss
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3876.patch, LUCENE-3876_test.patch


 {noformat}
 ant test -Dtestcase=TestIndexWriterExceptions 
 -Dtestmethod=testIllegalPositions 
 -Dtests.seed=-228094d3d2f35cf2:-496e33eec9bbd57c:36a1c54f4e1bb32 
 -Dargs=-Dfile.encoding=UTF-8
 [junit] junit.framework.AssertionFailedError: position=-2 lastPosition=0
 [junit] at 
 org.apache.lucene.codecs.lucene40.Lucene40PostingsWriter.addPosition(Lucene40PostingsWriter.java:215)
 [junit] at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:519)
 [junit] at 
 org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:92)
 [junit] at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
 [junit] at 
 org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
 [junit] at 
 org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
 [junit] at 
 org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:475)
 [junit] at 
 org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
 [junit] at 
 org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553)
 [junit] at 
 org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2640)
 [junit] at 
 org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2616)
 [junit] at 
 org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:851)
 [junit] at 
 org.apache.lucene.index.IndexWriter.close(IndexWriter.java:810)
 [junit] at 
 org.apache.lucene.index.IndexWriter.close(IndexWriter.java:774)
 [junit] at 
 org.apache.lucene.index.TestIndexWriterExceptions.testIllegalPositions(TestIndexWriterExceptions.java:1517)
 [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 [junit] at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 [junit] at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [junit] at java.lang.reflect.Method.invoke(Method.java:597)
 [junit] at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
 [junit] at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 [junit] at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
 [junit] at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
 [junit] at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 [junit] at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:729)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:645)
 [junit] at 
 org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:556)
 [junit] at 
 org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:618)
 [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
 [junit] at 
 org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
 [junit] at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
 [junit] at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:164)
 [junit] at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
 [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
 [junit] at 
 org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
 [junit] at 
 org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
 [junit] at 
 org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
 [junit] at

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231295#comment-13231295
 ] 

Michael McCandless commented on LUCENE-3738:


Hmm... I think we should think about it more.

Ie, we apparently never write a negative vLong today... and I'm not sure we 
should start allowing it...?

 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3878) CheckIndex should check deleted documents too

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231296#comment-13231296
 ] 

Michael McCandless commented on LUCENE-3878:


+1

 CheckIndex should check deleted documents too
 -

 Key: LUCENE-3878
 URL: https://issues.apache.org/jira/browse/LUCENE-3878
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 4.0
Reporter: Robert Muir
 Fix For: 4.0


 In 4.0 livedocs are passed down to the enums, thus deleted docs are not so 
 special.
 So I think checkindex should not pass the livedocs down to the enums when 
 checking,
 it should pass livedocs=null and check all the postings. It already does this 
 separately to 
 collect stats i think to compare against the term/collection statistics? But 
 we should
 just clean this up and only use one enum.
 For example LUCENE-3876 is a case where we were actually making a corrumpt 
 index,
 (a position was negative) but because the document in question was deleted, 
 CheckIndex 
 didn't detect this.
 This could have caused problems if someone just passed null for livedocs 
 (maybe they 
 are doing something where its not so important to take deletions into account)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3877) Lucene should not call System.out.println

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231307#comment-13231307
 ] 

Michael McCandless commented on LUCENE-3877:


I removed the std prints in lucene/core/src/java that I could find on quick 
grepping.

I'll leave this open so we can somehow automatically catch this...


 Lucene should not call System.out.println
 -

 Key: LUCENE-3877
 URL: https://issues.apache.org/jira/browse/LUCENE-3877
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0


 We seem to have accumulated a few random sops...
 Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least.
 Can we somehow detect (eg, have a test failure) if we accidentally leave 
 errant System.out.println's (leftover from debugging)...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-16 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231309#comment-13231309
 ] 

Michael McCandless commented on LUCENE-3738:


{quote}
 don't see how we can avoid negative vints. I think its ok to be inconsistent 
with vLong,
 but it should not be something we assert only at read-time. It should be 
asserted on write
 so that problems are found immediately.
{quote}

+1

I think we are stuck with negative vInts, as trappy as they are (5 bytes!!).

Let's not make it worse by allowing negative vLongs.  But let's assert that at 
write time (and read time)...

I think inconsistency here is the lesser evil.

 Be consistent about negative vInt/vLong
 ---

 Key: LUCENE-3738
 URL: https://issues.apache.org/jira/browse/LUCENE-3738
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3738.patch, LUCENE-3738.patch


 Today, write/readVInt allows a negative int, in that it will encode and 
 decode correctly, just horribly inefficiently (5 bytes).
 However, read/writeVLong fails (trips an assert).
 I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
 negative number... it's badly trappy today.  But, unfortunately, we sometimes 
 rely on this... had we had this assert in 'since the beginning' we could have 
 avoided that.
 So, if we can't add that assert in today, I think we should at least fix 
 readVLong to handle negative longs... but then you quietly spend 9 bytes 
 (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

2012-03-16 Thread Michael McCandless (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231470#comment-13231470
]

Michael McCandless commented on LUCENE-3738:

bq. If we disallow, it should be a hard check (no assert), as the data is
coming from a file (and somebody could used a hex editor). The reader will
crash later...

Hmm, I don't think we should do that.

If you go and edit your index with a hex editor... there are no guarantees on
what may ensue!

bq. Mike: If you fix the unrolled loops, please also add the checks to the
other implementations in Buffered* and so on.

I don't think the unrolled loops or other impls of write/readVLong are wrong?
The javadocs state clearly that negatives are not supported. All we're doing
here is added an assert to backup that javadoc statement.

Be consistent about negative vInt/vLong
---

Key: LUCENE-3738
URL: https://issues.apache.org/jira/browse/LUCENE-3738
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless
Fix For: 3.6, 4.0

Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3872) Index changes are lost if you call prepareCommit() then close()

2012-03-15 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230307#comment-13230307
 ] 

Michael McCandless commented on LUCENE-3872:


Well, we could also easily allow skipping the call to commit... in this case 
IW.close would detect the missing call to commit, call commit, and call commit 
again to save any changes done after the prepareCommit and before close.


 Index changes are lost if you call prepareCommit() then close()
 ---

 Key: LUCENE-3872
 URL: https://issues.apache.org/jira/browse/LUCENE-3872
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3872.patch, LUCENE-3872.patch


 You are supposed to call commit() after calling prepareCommit(), but... if 
 you forget, and call close() after prepareCommit() without calling commit(), 
 then any changes done after the prepareCommit() are silently lost (including 
 adding/deleting docs, but also any completed merges).
 Spinoff from java-user thread lots of .cfs (compound files) in the index 
 directory from Tim Bogaert.
 I think to fix this, IW.close should throw an IllegalStateException if 
 prepareCommit() was called with no matching call to commit().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

2012-03-15 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230341#comment-13230341
 ] 

Michael McCandless commented on LUCENE-3848:


+1


 basetokenstreamtestcase should fail if tokenstream starts with posinc=0
 ---

 Key: LUCENE-3848
 URL: https://issues.apache.org/jira/browse/LUCENE-3848
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3848-MockGraphTokenFilter.patch, 
 LUCENE-3848.patch, LUCENE-3848.patch


 This is meaningless for a tokenstream to start with posinc=0,
 Its also caused problems and hairiness in the indexer (LUCENE-1255, 
 LUCENE-1542),
 and it makes senseless tokenstreams. We should add a check and fix any that 
 do this.
 Furthermore the same bug can exist in removing-filters if they have 
 enablePositionIncrements=false.
 I think this option is useful: but it shouldnt mean 'allow broken 
 tokenstream', it just means we
 don't add gaps. 
 If you remove tokens with enablePositionIncrements=false it should not cause 
 the TS to start with
 positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. 
 moving synonyms on top of a different word).
 It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3874) bogus positions create a corrumpt index

2012-03-15 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230402#comment-13230402
 ] 

Michael McCandless commented on LUCENE-3874:


+1

Crazy we don't catch this already...

 bogus positions create a corrumpt index
 ---

 Key: LUCENE-3874
 URL: https://issues.apache.org/jira/browse/LUCENE-3874
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3874.patch, LUCENE-3874_test.patch


 Its pretty common that positionIncrement can overflow, this happens really 
 easily 
 if people write analyzers that don't clearAttributes().
 It used to be the case that if this happened (and perhaps still is in 3.x, i 
 didnt check),
 that IW would throw an exception.
 But i couldnt find the code checking this, I wrote a test and it makes a 
 corrumpt index...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 >

1 - 100 of 473 matches

Mail list logo