[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 118 - Failure

2011-08-02 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/118/

1 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByCoreName

Error Message:
No such core: core1

Stack Trace:
org.apache.solr.common.SolrException: No such core: core1
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByCoreName(MergeIndexesExampleTestBase.java:145)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)




Build Log (for compile errors):
[...truncated 12215 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-02 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076069#comment-13076069
 ] 

Christian Moen commented on LUCENE-3305:


Hello again, Simon.  I've filed the paperwork and copied you on email.  Hope 
you're enjoying your vacation!

 Kuromoji code donation - a new Japanese morphological analyzer
 --

 Key: LUCENE-3305
 URL: https://issues.apache.org/jira/browse/LUCENE-3305
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Christian Moen
Assignee: Simon Willnauer
 Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
 kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
 kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz


 Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
 morphological analyzer to the Apache Software Foundation in the hope that it 
 will be useful to Lucene and Solr users in Japan and elsewhere.
 The project was started in 2010 since we couldn't find any high-quality, 
 actively maintained and easy-to-use Java-based Japanese morphological 
 analyzers, and these become many of our design goals for Kuromoji.
 Kuromoji also has a segmentation mode that is particularly useful for search, 
 which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
 segmented as one token with most analyzers.  As a result, a search for 空港 
 (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
 can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
 you would want for search and you'll get a hit.
 We also wanted to make sure the technology has a license that makes it 
 compatible with other Apache Software Foundation software to maximize its 
 usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
 owned by Atilika Inc.  The software has been developed by my good friend and 
 ex-colleague Masaru Hasegawa and myself.
 Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
 its license terms are described in NOTICE.txt.
 I'll upload code distributions and their corresponding hashes and I'd very 
 much like to start the code grant process.  I'm also happy to provide patches 
 to integrate Kuromoji into the codebase, if you prefer that.
 Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3343) Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser

2011-08-02 Thread Olivier Favre (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076091#comment-13076091
 ] 

Olivier Favre commented on LUCENE-3343:
---

Great, thanks!
No blockers for 3x?

 Comparison operators ,=,,= and = support as RangeQuery syntax in 
 QueryParser
 

 Key: LUCENE-3343
 URL: https://issues.apache.org/jira/browse/LUCENE-3343
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/queryparser
Reporter: Olivier Favre
Assignee: Adriano Crestani
Priority: Minor
  Labels: parser, query
 Fix For: 3.4, 4.0

 Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 To offer better interoperability with other search engines and to provide an 
 easier and more straight forward syntax,
 the operators , =, , = and = should be available to express an open range 
 query.
 They should at least work for numeric queries.
 '=' can be made a synonym for ':'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I 
could only upload this patch now but I didn't have time to work on Lucene the 
last week.

As I see, all the problems you mentioned have been corrected, so maybe we can 
go on with the review?

 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)
Unit and integration test cases for the new Similarities


 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
 Fix For: flexscoring branch


Write test cases to test the new Similarities added in 
[LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
test cases will be created:
 * unit tests, in which mock statistics are provided to the Similarities and 
the score is validated against hand calculations;
 * integration tests, in which a small collection is indexed and then searched 
using the Similarities.

Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Component/s: core/query/scoring
 Labels: gsoc gsoc2011  (was: gsoc)

 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring, core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc, gsoc2011
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3357:


Labels: gsoc gsoc2011 test  (was: gsoc gsoc2011)

 Unit and integration test cases for the new Similarities
 

 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
  Labels: gsoc, gsoc2011, test
 Fix For: flexscoring branch


 Write test cases to test the new Similarities added in 
 [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
 test cases will be created:
  * unit tests, in which mock statistics are provided to the Similarities and 
 the score is validated against hand calculations;
  * integration tests, in which a small collection is indexed and then 
 searched using the Similarities.
 Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3357:


Labels: gsoc gsoc2011  (was: )

 Unit and integration test cases for the new Similarities
 

 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
  Labels: gsoc, gsoc2011, test
 Fix For: flexscoring branch


 Write test cases to test the new Similarities added in 
 [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
 test cases will be created:
  * unit tests, in which mock statistics are provided to the Similarities and 
 the score is validated against hand calculations;
  * integration tests, in which a small collection is indexed and then 
 searched using the Similarities.
 Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv

2011-08-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076132#comment-13076132
 ] 

Uwe Schindler commented on LUCENE-3335:
---

@Shay: Sorry I did not want to be too italian :-) I just wanted to ensure that 
such configurations, leading to bugs in JVMs, would be reported to us. It would 
help us to also respond quicker on such bug reports, like the one we already 
got 2 months ago (which nobody was able to reproduce, as we did not know that 
the user used aggressive opts).

 jrebug causes porter stemmer to sigsegv
 ---

 Key: LUCENE-3335
 URL: https://issues.apache.org/jira/browse/LUCENE-3335
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 
 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 
 3.3, 3.4, 4.0
 Environment: - JDK 7 Preview Release, GA (may also affect update _1, 
 targeted fix is JDK 1.7.0_2)
 - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
Reporter: Robert Muir
Assignee: Robert Muir
  Labels: Java7
 Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, 
 patch-0uwe.patch


 happens easily on java7: ant test -Dtestcase=TestPorterStemFilter 
 -Dtests.iter=100
 might happen on 1.6.0_u26 too, a user reported something that looks like the 
 same bug already:
 http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076156#comment-13076156
 ] 

Robert Muir commented on LUCENE-3335:
-

I don't think there is any sense in this, who cares?

We reported this crash to Oracle in plenty of time, and the *worse* 
wrong-results bug has been open since May 13: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738, but Oracle decided 
not to fix that, too.


 jrebug causes porter stemmer to sigsegv
 ---

 Key: LUCENE-3335
 URL: https://issues.apache.org/jira/browse/LUCENE-3335
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 
 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 
 3.3, 3.4, 4.0
 Environment: - JDK 7 Preview Release, GA (may also affect update _1, 
 targeted fix is JDK 1.7.0_2)
 - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
Reporter: Robert Muir
Assignee: Robert Muir
  Labels: Java7
 Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, 
 patch-0uwe.patch


 happens easily on java7: ant test -Dtestcase=TestPorterStemFilter 
 -Dtests.iter=100
 might happen on 1.6.0_u26 too, a user reported something that looks like the 
 same bug already:
 http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076171#comment-13076171
 ] 

Robert Muir commented on LUCENE-3220:
-

Hi David, i was thinking for the norm, we could store it like 
DefaultSimilarity. this would make it especially convenient, as you could 
easily use these similarities with the same exact index as one using Lucene's 
default scoring. Also I think (not sure!) by using 1/sqrt we will get better 
quantization from smallfloat?

{noformat}
  public byte computeNorm(FieldInvertState state) {
final int numTerms;
if (discountOverlaps)
  numTerms = state.getLength() - state.getNumOverlap();
else
  numTerms = state.getLength();
return encodeNormValue(state.getBoost() * ((float) (1.0 / 
Math.sqrt(numTerms;
  }
{noformat}

for computations, you have to 'undo' the sqrt() to get the quantized length, 
but thats ok since its only done up-front a single time and tableized, so it 
won't slow anything down.


 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring, core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc, gsoc2011
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict index

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076177#comment-13076177
 ] 

Robert Muir commented on LUCENE-3030:
-

This is awesome, i really like adding the intersect() hook!

Thanks for making a branch, I will check it out and try to dive in to help with 
some of this  :)

One trivial thing we might want to do is to add the logic currently in AQ's 
ctor to CA, so that you ask CA for its termsenum.
this way, if it can be accomplished with a simpler enum like just 
terms.iterator() or prefixtermsenum etc etc we get that optimization always.

 Block tree terms dict  index
 -

 Key: LUCENE-3030
 URL: https://issues.apache.org/jira/browse/LUCENE-3030
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
 LUCENE-3030.patch


 Our default terms index today breaks terms into blocks of fixed size
 (ie, every 32 terms is a new block), and then we build an index on top
 of that (holding the start term for each block).
 But, it should be better to instead break terms according to how they
 share prefixes.  This results in variable sized blocks, but means
 within each block we maximize the shared prefix and minimize the
 resulting terms index.  It should also be a speedup for terms dict
 intensive queries because the terms index becomes a true prefix
 trie, and can be used to fast-fail on term lookup (ie returning
 NOT_FOUND without having to seek/scan a terms block).
 Having a true prefix trie should also enable much faster intersection
 with automaton (but this will be a new issue).
 I've made an initial impl for this (called
 BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
 of nocommits, and hairy code, but tests pass (at least once!).
 I made two new codecs, temporarily called StandardTree, PulsingTree,
 that are just like their counterparts but use this new terms dict.
 I added a new exactOnly boolean to TermsEnum.seek.  If that's true
 and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
 enum is unpositioned (ie you should not call next(), docs(), etc.).
 In this approach the index and dict are tightly connected, so it does
 not support a pluggable index impl like BlockTermsWriter/Reader.
 Blocks are stored on certain nodes of the prefix trie, and can contain
 both terms and pointers to sub-blocks (ie, if the block is not a leaf
 block).  So there are two trees, tied to one another -- the index
 trie, and the blocks.  Only certain nodes in the trie map to a block
 in the block tree.
 I think this algorithm is similar to burst tries
 (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
 except it allows terms to be stored on inner blocks (not just leaf
 blocks).  This is important for Lucene because an [accidental]
 adversary could produce a terms dict with way too many blocks (way
 too much RAM used by the terms index).  Still, with my current patch,
 an adversary can produce too-big blocks... which we may need to fix,
 by letting the terms index not be a true prefix trie on it's leaf
 edges.
 Exactly how the blocks are picked can be factored out as its own
 policy (but I haven't done that yet).  Then, burst trie is one policy,
 my current approach is another, etc.  The policy can be tuned to
 the terms' expected distribution, eg if it's a primary key field and
 you only use base 10 for each character then you want block sizes of
 size 10.  This can make a sizable difference on lookup cost.
 I modified the FST Builder to allow for a plugin that freezes the
 tail (changed suffix) of each added term, because I use this to find
 the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict index

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076178#comment-13076178
 ] 

Robert Muir commented on LUCENE-3030:
-

Also, we should measure if a prefix automaton with intersect() is faster than 
PrefixTermsEnum (I suspect it could be!)

If this is true, we might want to not rewrite to prefixtermsenum anymore, 
instead changing PrefixQuery to extend AutomatonQuery too.

 Block tree terms dict  index
 -

 Key: LUCENE-3030
 URL: https://issues.apache.org/jira/browse/LUCENE-3030
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
 LUCENE-3030.patch


 Our default terms index today breaks terms into blocks of fixed size
 (ie, every 32 terms is a new block), and then we build an index on top
 of that (holding the start term for each block).
 But, it should be better to instead break terms according to how they
 share prefixes.  This results in variable sized blocks, but means
 within each block we maximize the shared prefix and minimize the
 resulting terms index.  It should also be a speedup for terms dict
 intensive queries because the terms index becomes a true prefix
 trie, and can be used to fast-fail on term lookup (ie returning
 NOT_FOUND without having to seek/scan a terms block).
 Having a true prefix trie should also enable much faster intersection
 with automaton (but this will be a new issue).
 I've made an initial impl for this (called
 BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
 of nocommits, and hairy code, but tests pass (at least once!).
 I made two new codecs, temporarily called StandardTree, PulsingTree,
 that are just like their counterparts but use this new terms dict.
 I added a new exactOnly boolean to TermsEnum.seek.  If that's true
 and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
 enum is unpositioned (ie you should not call next(), docs(), etc.).
 In this approach the index and dict are tightly connected, so it does
 not support a pluggable index impl like BlockTermsWriter/Reader.
 Blocks are stored on certain nodes of the prefix trie, and can contain
 both terms and pointers to sub-blocks (ie, if the block is not a leaf
 block).  So there are two trees, tied to one another -- the index
 trie, and the blocks.  Only certain nodes in the trie map to a block
 in the block tree.
 I think this algorithm is similar to burst tries
 (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
 except it allows terms to be stored on inner blocks (not just leaf
 blocks).  This is important for Lucene because an [accidental]
 adversary could produce a terms dict with way too many blocks (way
 too much RAM used by the terms index).  Still, with my current patch,
 an adversary can produce too-big blocks... which we may need to fix,
 by letting the terms index not be a true prefix trie on it's leaf
 edges.
 Exactly how the blocks are picked can be factored out as its own
 policy (but I haven't done that yet).  Then, burst trie is one policy,
 my current approach is another, etc.  The policy can be tuned to
 the terms' expected distribution, eg if it's a primary key field and
 you only use base 10 for each character then you want block sizes of
 size 10.  This can make a sizable difference on lookup cost.
 I modified the FST Builder to allow for a plugin that freezes the
 tail (changed suffix) of each added term, because I use this to find
 the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Markus Jelsma (JIRA)
!frange with query($qq) sets score=1.0f for all returned documents
--

 Key: SOLR-2689
 URL: https://issues.apache.org/jira/browse/SOLR-2689
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4
Reporter: Markus Jelsma
 Fix For: 3.4, 4.0


Consider the following queries, both query the default field for 'test' and 
return the document digest and score (i don't seem to be able get only score, 
fl=score returns all fields):

This is a normal query and yields normal results with proper scores:

{code}
q=testfl=digest,score
{code}

{code}
result name=response numFound=227763 start=0 maxScore=4.952673
−
doc
float name=score4.952673/float
str name=digestc48e784f06a051d89f20b72194b0dcf0/str
/doc
−
doc
float name=score4.952673/float
str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
/doc
−
doc
float name=score4.952673/float
str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
/doc
{code}

This query uses frange with query() to limit the number of returned documents. 
When using multiple search terms i can indeed cut-off the result set but in the 
end all returned documents have score=1.0f. The final result set cannot be 
sorted by score anymore. The result set seems to be returned in the order of 
Lucene docId's.

{code}
q={!frange l=1.23}query($qq)qq=testfl=digest,score
{code}

{code}
result name=response numFound=227763 start=0 maxScore=1.0
−
doc
float name=score1.0/float
str name=digestc48e784f06a051d89f20b72194b0dcf0/str
/doc
−
doc
float name=score1.0/float
str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
/doc
−
doc
float name=score1.0/float
str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
/doc
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread Erik Hatcher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher resolved SOLR-1032.


   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Erik Hatcher

 CSV loader to support literal field values
 --

 Key: SOLR-1032
 URL: https://issues.apache.org/jira/browse/SOLR-1032
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.3
Reporter: Erik Hatcher
Assignee: Erik Hatcher
Priority: Minor
 Fix For: 4.0

 Attachments: SOLR-1032.patch, SOLR-1032.patch


 It would be very handy if the CSV loader could handle a literal field 
 mapping, like the extracting request handler does.  For example, in a 
 scenario where you have multiple datasources (some data from a DB, some from 
 file crawls, and some from CSV) it is nice to add a field to every document 
 that specifies the data source.  This is easily done with DIH with a template 
 transformer, and Solr Cell with ext.literal.datasource=, but impossible 
 currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone

2011-08-02 Thread David (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076294#comment-13076294
 ] 

David commented on SOLR-2525:
-

Timezone needs to be taken into account when doing date math. Currently it 
isn't. DateMathParser instances created are told to use UTC. This is a huge 
issue when it comes to faceting. Depending on your timezone day-light-savings 
changes the length of a month. A facet gap of +1MONTH is different depending on 
the timezone and the time of the year.

I believe the issue is very simple to fix. There are three places in the code 
DateMathParser created. All three are configured with the timezone being UTC. 
If a user could specify the TimeZone to pass into DateMathParser this faceting 
issue would be resolved.

 Date Faceting or Range Faceting with offset doesn't convert timezone
 

 Key: SOLR-2525
 URL: https://issues.apache.org/jira/browse/SOLR-2525
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis, search
Affects Versions: 3.1
 Environment: Solr 3.1 
 Windows 2008 RC2 Server 
 Java 6
 Running on Jetty
Reporter: Rohit Gupta
  Labels: date, facet

 I am trying to facet based on date field and apply user timezone offset so 
 that the faceted results are in user timezone. My faceted result is given 
 below,
 ?xml version=1.0 encoding=UTF-8?
 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime6/int
   lst name=params
   str name=facettrue/str
   str name=qicici/str
   str 
 name=facet.range.start2011-05-02T00:00:00Z+330MINUTES/str
   str name=facet.rangecreatedOnGMTDate/str
   str name=facet.range.end2011-05-18T00:00:00Z/str
   str name=facet.range.gap+1DAY/str
   /lst
   /lst
   lst name=facet_counts
 lst name=facet_ranges
   lst name=createdOnGMTDate
   lst name=counts
   int name=2011-05-02T05:30:00Z4/int
   int name=2011-05-03T05:30:00Z63/int
   int name=2011-05-04T05:30:00Z0/int
   int name=2011-05-05T05:30:00Z0/int
  ..   
   /lst
   str name=gap+1DAY/str
   date name=start2011-05-02T05:30:00Z/date
   date name=end2011-05-18T05:30:00Z/date
   /lst
  /lst
   /lst
   /response
 Now if you notice that the response show 4 records for the 2th of May 2011 
 which will fall in the IST timezone (+330MINUTES), but when I try to get the 
 results I see that there is only 1 result for the 2nd why is this happening.
 ?xml version=1.0 encoding=UTF-8?
 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime5/int
   lst name=params
   str name=sortcreatedOnGMTDate asc/str
   str 
 name=flcreatedOnGMT,createdOnGMTDate,twtText/str
   str 
 name=fqcreatedOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *]  /str
   str name=qicici/str
   /lst
   /lst
   result name=response numFound=67 start=0
   doc
   str name=createdOnGMTMon, 02 May 2011 16:27:05+/str
   date name=createdOnGMTDate2011-05-02T16:27:05Z/date
   str name=twtText#TechStrat615. Infosys (business soln amp; 
 IT
   outsourcer) manages damages with new chairman 
 K.Kamath (ex ICICI
   Bank chairman) to begin Aug 21./str
   /doc
   doc
   str name=createdOnGMTMon, 02 May 2011 19:00:44+/str
   date name=createdOnGMTDate2011-05-02T19:00:44Z/date
   str name=twtTexthow to get icici mobile banking/str
   /doc
   doc
   str name=createdOnGMTTue, 03 May 2011 01:53:05+/str
   date name=createdOnGMTDate2011-05-03T01:53:05Z/date
   str name=twtTextICICI BANK LTD, L. M. MIRAJ branch in 
 SANGLI,
   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
 Code: ...
   http://bit.ly/fJCuWl #ifsc #micr #bank/str
   /doc
   doc
   str name=createdOnGMTTue, 03 May 2011 01:53:05+/str
   date name=createdOnGMTDate2011-05-03T01:53:05Z/date
   str name=twtTextICICI BANK LTD, L. M. MIRAJ branch in 
 SANGLI,
   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
 Code: ...
   

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2011-08-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076301#comment-13076301
 ] 

Shawn Heisey commented on SOLR-1972:


Hoss, the patch isn't my work, I just modified it to support a 100th percentile 
and reattached it.  I am only just now beginning to learn Java.  Although I 
have some clue what you're saying with static methods, actually doing it 
properly within a larger work like Solr is something I won't be able to do yet.

 Need additional query stats in admin interface - median, 95th and 99th 
 percentile
 -

 Key: SOLR-1972
 URL: https://issues.apache.org/jira/browse/SOLR-1972
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Shawn Heisey
Priority: Minor
 Attachments: SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
 SOLR-1972.patch, elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
 elyograg-1972-trunk.patch, elyograg-1972-trunk.patch


 I would like to see more detailed query statistics from the admin GUI.  This 
 is what you can get now:
 requests : 809
 errors : 0
 timeouts : 0
 totalTime : 70053
 avgTimePerRequest : 86.59209
 avgRequestsPerSecond : 0.8148785 
 I'd like to see more data on the time per request - median, 95th percentile, 
 99th percentile, and any other statistical function that makes sense to 
 include.  In my environment, the first bunch of queries after startup tend to 
 take several seconds each.  I find that the average value tends to be useless 
 until it has several thousand queries under its belt and the caches are 
 thoroughly warmed.  The statistical functions I have mentioned would quickly 
 eliminate the influence of those initial slow queries.
 The system will have to store individual data about each query.  I don't know 
 if this is something Solr does already.  It would be nice to have a 
 configurable count of how many of the most recent data points are kept, to 
 control the amount of memory the feature uses.  The default value could be 
 something like 1024 or 4096.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David (JIRA)
Date Faceting or Range Faceting doesn't take timezone into account.
---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David


Timezone needs to be taken into account when doing date math. Currently it 
isn't. DateMathParser instances created are always being constructed with UTC. 
This is a huge issue when it comes to faceting. Depending on your timezone 
day-light-savings changes the length of a month. A facet gap of +1MONTH is 
different depending on the timezone and the time of the year.

I believe the issue is very simple to fix. There are three places in the code 
DateMathParser is created. All three are configured with the timezone being 
UTC. If a user could specify the TimeZone to pass into DateMathParser this 
faceting issue would be resolved.

Though it would be nice if we could always specify the timezone DateMathParser 
uses (since date math DOES depend on timezone) its really only essential that 
we can affect DateMathParser the SimpleFacets uses when dealing with the gap of 
the date facets.

Another solution is to expand the syntax of the expressions DateMathParser 
understands. For example we could allow (?timeZone=VALUE) to be added 
anywhere within an expression. VALUE would be the id of the timezone. When 
DateMathParser reads this in sets the timezone on the Calendar it is using.

Two examples:
- (?timeZone=America/Chicago)NOW/YEAR
- (?timeZone=America/Chicago)+1MONTH

I would be more then happy to modify DateMathParser and provide a patch. I just 
need a committer to agree this needs to be resolved and a decision needs to be 
made on the syntax used


Thanks!
David


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076307#comment-13076307
 ] 

Yonik Seeley commented on SOLR-2690:


Although this probably isn't a bug, I agree that handling timezones somehow 
would be nice.
We just need to think very carefully about the API so we can support it long 
term.

One immediate thought I had was that it would be a pain to specify the timezone 
everywhere.  Even a simple range query would need to specify it twice:
my_date:[(?timeZone=America/Chicago)NOW/YEAR TO 
(?timeZone=America/Chicago)+1MONTH]

So one possible alternative that needs more thought is a TZ request parameter 
that would apply by default to things that are date related.


 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078292#comment-13078292
 ] 

David commented on SOLR-2690:
-

Good point.

Also, this isn't a bug but if we want a complete solution, we really need a way 
to specify times in other timezones.

If I want midnight in Central time zone I shouldn't have to write: 
2011-01-01T06:00:00Z
(Note I wrote 6:00 not 0:00)
I believe only DateField would have to be modified to make it possible to 
specify timezone.

For a complete example if I wanted to facet blog posts by the date posted and 
the month:

facet.date=blogPostDate
facet.date.start=2011-01-01T00:00:00
facet.date.end=2012-01-01T00:00:00
facet.date.gap=+1MONTH
timezone=America/Chicago

Currently you would need to do the following. (Which actually gives close to 
correct results but not exact. Again, problem is the gap of +1MONTH doesn't 
take daylight savings into account so blog posts on the edge of ranges are 
counted in the wrong range.

facet.date=blogPostDate
facet.date.start=2011-01-01T00:06:00Z
facet.date.end=2012-01-01T00:06:00Z
facet.date.gap=+1MONTH


 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 124 - Failure

2011-08-02 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/124/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:639)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:99)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:174)




Build Log (for compile errors):
[...truncated 11177 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3030) Block tree terms dict index

2011-08-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3030:
---

Attachment: BlockTree.png

The block tree terms dict seems to be working... all tests pass w/
StandardTree codec.  There's still more to do (many nocommits), but, I
think the perf results should be close to what I finally commit:

||Task||QPS base||StdDev base||QPS blocktree||StdDev blocktree||Pct diff
|IntNRQ|11.58|1.37|10.11|1.77|{color:red}35%{color}-{color:green}16%{color}|
|Term|106.65|3.24|98.84|4.97|{color:red}14%{color}-{color:green}0%{color}|
|Prefix3|30.83|1.36|28.64|2.42|{color:red}18%{color}-{color:green}5%{color}|
|OrHighHigh|5.85|0.15|5.44|0.28|{color:red}14%{color}-{color:green}0%{color}|
|OrHighMed|19.25|0.62|17.91|0.86|{color:red}14%{color}-{color:green}0%{color}|
|Phrase|9.37|0.42|8.87|0.10|{color:red}10%{color}-{color:green}0%{color}|
|TermBGroup1M|44.02|0.90|42.76|1.08|{color:red}7%{color}-{color:green}1%{color}|
|TermGroup1M|37.68|0.65|36.95|0.74|{color:red}5%{color}-{color:green}1%{color}|
|TermBGroup1M1P|47.16|2.94|46.36|0.16|{color:red}7%{color}-{color:green}5%{color}|
|SpanNear|5.60|0.35|5.55|0.29|{color:red}11%{color}-{color:green}11%{color}|
|SloppyPhrase|3.36|0.16|3.34|0.04|{color:red}6%{color}-{color:green}5%{color}|
|Wildcard|35.15|1.30|35.05|2.42|{color:red}10%{color}-{color:green}10%{color}|
|AndHighHigh|10.71|0.22|10.99|0.22|{color:red}1%{color}-{color:green}6%{color}|
|AndHighMed|51.15|1.44|54.31|1.84|{color:green}0%{color}-{color:green}12%{color}|
|Fuzzy1|31.63|0.55|66.15|1.35|{color:green}101%{color}-{color:green}117%{color}|
|PKLookup|40.00|0.75|84.93|5.49|{color:green}94%{color}-{color:green}130%{color}|
|Fuzzy2|33.78|0.82|89.59|2.46|{color:green}151%{color}-{color:green}179%{color}|
|Respell|23.56|1.15|70.89|1.77|{color:green}179%{color}-{color:green}224%{color}|

This is for a multi-segment index, 10 M wikipedia docs, using luceneutil.

These are huge speedups for the terms-dict intensive queries!

The two FuzzyQuerys and Respell get the speedup from the directly
implemented intersect method, and the PKLookup gets gains because it
can often avoid seeking since block tree's terms index can sometimes
rule out terms by their prefix (though, this relies on the PK terms
being predictable -- I use %09d w/ a counter, now; if you instead
used something more random looking (GUIDs )I don't think we'd see
gains).


 Block tree terms dict  index
 -

 Key: LUCENE-3030
 URL: https://issues.apache.org/jira/browse/LUCENE-3030
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
 LUCENE-3030.patch, LUCENE-3030.patch


 Our default terms index today breaks terms into blocks of fixed size
 (ie, every 32 terms is a new block), and then we build an index on top
 of that (holding the start term for each block).
 But, it should be better to instead break terms according to how they
 share prefixes.  This results in variable sized blocks, but means
 within each block we maximize the shared prefix and minimize the
 resulting terms index.  It should also be a speedup for terms dict
 intensive queries because the terms index becomes a true prefix
 trie, and can be used to fast-fail on term lookup (ie returning
 NOT_FOUND without having to seek/scan a terms block).
 Having a true prefix trie should also enable much faster intersection
 with automaton (but this will be a new issue).
 I've made an initial impl for this (called
 BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
 of nocommits, and hairy code, but tests pass (at least once!).
 I made two new codecs, temporarily called StandardTree, PulsingTree,
 that are just like their counterparts but use this new terms dict.
 I added a new exactOnly boolean to TermsEnum.seek.  If that's true
 and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
 enum is unpositioned (ie you should not call next(), docs(), etc.).
 In this approach the index and dict are tightly connected, so it does
 not support a pluggable index impl like BlockTermsWriter/Reader.
 Blocks are stored on certain nodes of the prefix trie, and can contain
 both terms and pointers to sub-blocks (ie, if the block is not a leaf
 block).  So there are two trees, tied to one another -- the index
 trie, and the blocks.  Only certain nodes in the trie map to a block
 in the block tree.
 I think this algorithm is similar to burst tries
 (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
 except it allows terms to be stored on inner blocks (not just leaf
 blocks).  This is important for Lucene because an [accidental]
 adversary could produce a terms dict with way too many 

[jira] [Commented] (LUCENE-3030) Block tree terms dict index

2011-08-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078337#comment-13078337
 ] 

Michael McCandless commented on LUCENE-3030:


Here's the graph of the results:

!BlockTree.png!


 Block tree terms dict  index
 -

 Key: LUCENE-3030
 URL: https://issues.apache.org/jira/browse/LUCENE-3030
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
 LUCENE-3030.patch, LUCENE-3030.patch


 Our default terms index today breaks terms into blocks of fixed size
 (ie, every 32 terms is a new block), and then we build an index on top
 of that (holding the start term for each block).
 But, it should be better to instead break terms according to how they
 share prefixes.  This results in variable sized blocks, but means
 within each block we maximize the shared prefix and minimize the
 resulting terms index.  It should also be a speedup for terms dict
 intensive queries because the terms index becomes a true prefix
 trie, and can be used to fast-fail on term lookup (ie returning
 NOT_FOUND without having to seek/scan a terms block).
 Having a true prefix trie should also enable much faster intersection
 with automaton (but this will be a new issue).
 I've made an initial impl for this (called
 BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
 of nocommits, and hairy code, but tests pass (at least once!).
 I made two new codecs, temporarily called StandardTree, PulsingTree,
 that are just like their counterparts but use this new terms dict.
 I added a new exactOnly boolean to TermsEnum.seek.  If that's true
 and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
 enum is unpositioned (ie you should not call next(), docs(), etc.).
 In this approach the index and dict are tightly connected, so it does
 not support a pluggable index impl like BlockTermsWriter/Reader.
 Blocks are stored on certain nodes of the prefix trie, and can contain
 both terms and pointers to sub-blocks (ie, if the block is not a leaf
 block).  So there are two trees, tied to one another -- the index
 trie, and the blocks.  Only certain nodes in the trie map to a block
 in the block tree.
 I think this algorithm is similar to burst tries
 (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
 except it allows terms to be stored on inner blocks (not just leaf
 blocks).  This is important for Lucene because an [accidental]
 adversary could produce a terms dict with way too many blocks (way
 too much RAM used by the terms index).  Still, with my current patch,
 an adversary can produce too-big blocks... which we may need to fix,
 by letting the terms index not be a true prefix trie on it's leaf
 edges.
 Exactly how the blocks are picked can be factored out as its own
 policy (but I haven't done that yet).  Then, burst trie is one policy,
 my current approach is another, etc.  The policy can be tuned to
 the terms' expected distribution, eg if it's a primary key field and
 you only use base 10 for each character then you want block sizes of
 size 10.  This can make a sizable difference on lookup cost.
 I modified the FST Builder to allow for a plugin that freezes the
 tail (changed suffix) of each added term, because I use this to find
 the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078352#comment-13078352
 ] 

David Schlotfeldt commented on SOLR-2690:
-

By extending FacetComponent (and having to resort to reflection) I added: 
facet.date.gap.tz

The new parameter only affects the gap. The math done with processing the gap 
is the largest issue when it comes it date faceting in my mind.

I would be more then happy to provide a patch to add this feature.

No this doesn't address all timezone issues but at least it would address the 
main issue that makes date faceting, in my eyes, completely useless. I bet 
there are 100s of people out there using date faceting that don't realize it 
does NOT give correct results :)




 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David Schlotfeldt
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078353#comment-13078353
 ] 

Robert Muir commented on SOLR-2688:
---

I'll work up a patch, might tweak the example a bit for the time being, I'd 
like to err on the side of performance.

Note: with LUCENE-3030, Mike has really sped this guy up again.

 switch solr 4.0 example to DirectSpellChecker
 -

 Key: SOLR-2688
 URL: https://issues.apache.org/jira/browse/SOLR-2688
 Project: Solr
  Issue Type: Improvement
  Components: spellchecker
Affects Versions: 4.0
Reporter: Robert Muir

 For discussion: we might want to switch the Solr 4.0 example to use 
 DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3343) Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser

2011-08-02 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078372#comment-13078372
 ] 

Adriano Crestani commented on LUCENE-3343:
--

Hi Oliver,

I was only able to make your patch work when I merged with LUCENE-3338, however 
LUCENE-3338 is only available for trunk, not 3x. I will need to figure out some 
other way to make it work on 3x. I plan to work on this soon, probably next 
weekend.

 Comparison operators ,=,,= and = support as RangeQuery syntax in 
 QueryParser
 

 Key: LUCENE-3343
 URL: https://issues.apache.org/jira/browse/LUCENE-3343
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/queryparser
Reporter: Olivier Favre
Assignee: Adriano Crestani
Priority: Minor
  Labels: parser, query
 Fix For: 3.4, 4.0

 Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 To offer better interoperability with other search engines and to provide an 
 easier and more straight forward syntax,
 the operators , =, , = and = should be available to express an open range 
 query.
 They should at least work for numeric queries.
 '=' can be made a synonym for ':'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 125 - Still Failing

2011-08-02 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/125/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
expected:2 but was:3

Stack Trace:
junit.framework.AssertionFailedError: expected:2 but was:3
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:198)




Build Log (for compile errors):
[...truncated 11154 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-08-02 Thread James Dyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078384#comment-13078384
 ] 

James Dyer commented on SOLR-2382:
--

Lance,

I do not have any scientific benchmarks, but I can tell you how we use 
BerkleyBackedCache and how it performs for us.  

In our main app, we fully re-index all our data every night (13+ million 
records).  Its basically a 2-step process.  First we run ~50 DIH handlers, each 
of which builds a cache from databases, flat files, etc.  The caches partition 
the data 8-ways.  Then a master DIH script does all the joining, runs 
transformers on the data, etc.  We have all 8 invocations of this same master 
DIH config running simultaneously indexing to the same Solr core, so each DIH 
invocation is processing 1.6 million records directly out of caches, doing all 
the 1-many joins, running transformer code, indexing, etc.  This takes 1-1/2 
hours, so maybe 250-300 solr records get added per second.  We're using fast 
local disks configured with raid-0 on an 8-core 64gb server.  This app is 
running solr 1.4, using the original version of this patch, prior to my 
front-porting it to trunk.  No doubt some of the time is spent contending for 
the Lucene index as all 8 DIH invocations are indexing at the same time
 .

We also have another app that uses Solr4.0 with the patch I originally posted 
back in February, sharing hardware with the main app.  This one has about 10 
entities and uses a simple 1-dih-handler configuration.  The parent entity 
drives directly off the database while all the child entities use 
SqlEntityProcessor with BerkleyBackedCache.  There are only 25,000 fairly 
narrow records and we can re-index everything in about 10 minutes.  This 
includes database time, indexing, running transformers, etc in addition to the 
cache overhead.

The inspiration for this was that we were converting off of Endeca and we were 
relying on Endeca's Forge program to join  denormalize all of the data.  
Forge has a very fast disk-backed caching mechanism and I needed to match that 
performance with DIH.  I'm pretty sure what we have here surpasses Forge.  And 
we also get a big bonus in that it lets you persist caches and use them as a 
subsequent input.  With Forge, we had to output the data into huge delimited 
text files and then use that as input for the next step...

Hope this information gives you some idea if this will work for your use case.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data 

[jira] [Commented] (LUCENE-3030) Block tree terms dict index

2011-08-02 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078400#comment-13078400
 ] 

Simon Willnauer commented on LUCENE-3030:
-

bq. These are huge speedups for the terms-dict intensive queries!
oh boy! This is awesome!

 Block tree terms dict  index
 -

 Key: LUCENE-3030
 URL: https://issues.apache.org/jira/browse/LUCENE-3030
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
 LUCENE-3030.patch, LUCENE-3030.patch


 Our default terms index today breaks terms into blocks of fixed size
 (ie, every 32 terms is a new block), and then we build an index on top
 of that (holding the start term for each block).
 But, it should be better to instead break terms according to how they
 share prefixes.  This results in variable sized blocks, but means
 within each block we maximize the shared prefix and minimize the
 resulting terms index.  It should also be a speedup for terms dict
 intensive queries because the terms index becomes a true prefix
 trie, and can be used to fast-fail on term lookup (ie returning
 NOT_FOUND without having to seek/scan a terms block).
 Having a true prefix trie should also enable much faster intersection
 with automaton (but this will be a new issue).
 I've made an initial impl for this (called
 BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
 of nocommits, and hairy code, but tests pass (at least once!).
 I made two new codecs, temporarily called StandardTree, PulsingTree,
 that are just like their counterparts but use this new terms dict.
 I added a new exactOnly boolean to TermsEnum.seek.  If that's true
 and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
 enum is unpositioned (ie you should not call next(), docs(), etc.).
 In this approach the index and dict are tightly connected, so it does
 not support a pluggable index impl like BlockTermsWriter/Reader.
 Blocks are stored on certain nodes of the prefix trie, and can contain
 both terms and pointers to sub-blocks (ie, if the block is not a leaf
 block).  So there are two trees, tied to one another -- the index
 trie, and the blocks.  Only certain nodes in the trie map to a block
 in the block tree.
 I think this algorithm is similar to burst tries
 (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
 except it allows terms to be stored on inner blocks (not just leaf
 blocks).  This is important for Lucene because an [accidental]
 adversary could produce a terms dict with way too many blocks (way
 too much RAM used by the terms index).  Still, with my current patch,
 an adversary can produce too-big blocks... which we may need to fix,
 by letting the terms index not be a true prefix trie on it's leaf
 edges.
 Exactly how the blocks are picked can be factored out as its own
 policy (but I haven't done that yet).  Then, burst trie is one policy,
 my current approach is another, etc.  The policy can be tuned to
 the terms' expected distribution, eg if it's a primary key field and
 you only use base 10 for each character then you want block sizes of
 size 10.  This can make a sizable difference on lookup cost.
 I modified the FST Builder to allow for a plugin that freezes the
 tail (changed suffix) of each added term, because I use this to find
 the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

EasySimilarity now computes norms in the same way as DefaultSimilarity.

Actually not exactly the same way, as I have not yet added the discountOverlaps 
property. I think it would be a good idea for EasySimilarity as well (it is for 
phrases, right), what do you reckon?

I also wrote a quick test to see which norm (length directly or 1/sqrt) is 
closer to the original value and it seems that the direct one is usually much 
closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know 
it is much more important that the new Similarities can use existing indices.

 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring, core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc, gsoc2011
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[Lucene.Net] [jira] [Resolved] (LUCENENET-404) Improve brand logo design

2011-08-02 Thread Troy Howard (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENENET-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Troy Howard resolved LUCENENET-404.
---

Resolution: Fixed

Uploaded the artifacts in r1153264

 Improve brand logo design
 -

 Key: LUCENENET-404
 URL: https://issues.apache.org/jira/browse/LUCENENET-404
 Project: Lucene.Net
  Issue Type: Sub-task
  Components: Project Infrastructure
Reporter: Troy Howard
Assignee: Troy Howard
Priority: Minor
  Labels: branding, logo
 Attachments: lucene-alternates.jpg, lucene-medium.png, 
 lucene-net-logo-display.jpg


 The existing Lucene.Net logo leaves a lot to be desired. We'd like a new logo 
 that is modern and well designed. 
 To implement this, Troy is coordinating with StackOverflow/StackExchange to 
 manage a logo design contest, the results of which will be our new logo 
 design. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078419#comment-13078419
 ] 

David Schlotfeldt commented on SOLR-2690:
-

Okay I've modified my code to now take facet.date.tz instead. The time zone 
now affects the facet's start, end and gap values.

 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David Schlotfeldt
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-2143) Add OpenSearch resources to the bundled example

2011-08-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-2143:
-

Assignee: (was: Grant Ingersoll)

 Add OpenSearch resources to the bundled example 
 

 Key: SOLR-2143
 URL: https://issues.apache.org/jira/browse/SOLR-2143
 Project: Solr
  Issue Type: Wish
  Components: documentation
Affects Versions: 4.0
 Environment: N/A
Reporter: Rich Cariens
 Fix For: 4.0

 Attachments: SOLR-2143.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 Guidance  samples on how to make a Solr instance OpenSearch-compliant feels 
 like it would add value to the user community.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2011-08-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078434#comment-13078434
 ] 

Grant Ingersoll commented on LUCENE-2748:
-

I wonder if the best thing to do here is to simply start fresh and clean and 
simply leave all existing content up as is and link to it as the old content.

 Convert all Lucene web properties to use the ASF CMS
 

 Key: LUCENE-2748
 URL: https://issues.apache.org/jira/browse/LUCENE-2748
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 The new CMS has a lot of nice features (and some kinks to still work out) and 
 Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
 http://apache.org/dev/cms.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078451#comment-13078451
 ] 

Otis Gospodnetic commented on SOLR-2689:


Markus - I can't even tell this frange call cuts-off any of the hits - you have 
numFound=227763 in both examples above.  Am I missing something? :)

 !frange with query($qq) sets score=1.0f for all returned documents
 --

 Key: SOLR-2689
 URL: https://issues.apache.org/jira/browse/SOLR-2689
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4
Reporter: Markus Jelsma
 Fix For: 3.4, 4.0


 Consider the following queries, both query the default field for 'test' and 
 return the document digest and score (i don't seem to be able get only score, 
 fl=score returns all fields):
 This is a normal query and yields normal results with proper scores:
 {code}
 q=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=4.952673
 −
 doc
 float name=score4.952673/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}
 This query uses frange with query() to limit the number of returned 
 documents. When using multiple search terms i can indeed cut-off the result 
 set but in the end all returned documents have score=1.0f. The final result 
 set cannot be sorted by score anymore. The result set seems to be returned in 
 the order of Lucene docId's.
 {code}
 q={!frange l=1.23}query($qq)qq=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=1.0
 −
 doc
 float name=score1.0/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Did solr.xml persistence brake?

2011-08-02 Thread Yury Kats
Hi,

With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command,
the solr.xml gets overwritten with only the last core, repeated as many times
as there are cores.

It used to work fine with a trunk build from a couple of months ago, so it 
looks like
something broke solr.xml persistence. Can it be related to SOLR-2331?

I'm running SolrCloud, using:
-Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun

I'm starting Solr with four cores listed in solr.xml:

solr persistent=true
  cores adminPath=/admin/cores defaultCoreName=master1
core name=master1 instanceDir=master1 shard=shard1 
collection=hcpconf /
core name=master2 instanceDir=master2 shard=shard2 
collection=hcpconf /
core name=slave1 instanceDir=slave1 shard=shard1 
collection=hcpconf /
core name=slave2 instanceDir=slave2 shard=shard2 
collection=hcpconf /
  /cores
/solr

I then issue a PERSIST request:
http://localhost:8983/solr/admin/cores?action=PERSIST

And the solr.xml turns into:

solr persistent=true
  cores defaultCoreName=master1 adminPath=/admin/cores 
zkClientTimeout=1 hostPort=8983 hostContext=solr
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
  /cores
/solr

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2011-08-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-1692:
-

Assignee: (was: Grant Ingersoll)

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
 Fix For: 3.4, 4.0

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078467#comment-13078467
 ] 

Markus Jelsma commented on SOLR-2689:
-

You are right, it's because both examples use one search term and thus all have 
the same score. It shows when not all scores are identical when you use 
multiple terms. I'll provide a better description and example next week when 
i'll get back.

 !frange with query($qq) sets score=1.0f for all returned documents
 --

 Key: SOLR-2689
 URL: https://issues.apache.org/jira/browse/SOLR-2689
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4
Reporter: Markus Jelsma
 Fix For: 3.4, 4.0


 Consider the following queries, both query the default field for 'test' and 
 return the document digest and score (i don't seem to be able get only score, 
 fl=score returns all fields):
 This is a normal query and yields normal results with proper scores:
 {code}
 q=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=4.952673
 −
 doc
 float name=score4.952673/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}
 This query uses frange with query() to limit the number of returned 
 documents. When using multiple search terms i can indeed cut-off the result 
 set but in the end all returned documents have score=1.0f. The final result 
 set cannot be sorted by score anymore. The result set seems to be returned in 
 the order of Lucene docId's.
 {code}
 q={!frange l=1.23}query($qq)qq=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=1.0
 −
 doc
 float name=score1.0/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078522#comment-13078522
 ] 

Hoss Man commented on SOLR-2689:


I don't really understand why this is a bug?

frange is the FunctionRangeQParserPlugin which produces 
ConstantScoreRangeQueries -- it doesn't matter when/how/why it's used (or that 
the function it's wrapping comes from an arbitrary query), it always produces 
range queries that generate constant scores.

 !frange with query($qq) sets score=1.0f for all returned documents
 --

 Key: SOLR-2689
 URL: https://issues.apache.org/jira/browse/SOLR-2689
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4
Reporter: Markus Jelsma
 Fix For: 3.4, 4.0


 Consider the following queries, both query the default field for 'test' and 
 return the document digest and score (i don't seem to be able get only score, 
 fl=score returns all fields):
 This is a normal query and yields normal results with proper scores:
 {code}
 q=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=4.952673
 −
 doc
 float name=score4.952673/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score4.952673/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}
 This query uses frange with query() to limit the number of returned 
 documents. When using multiple search terms i can indeed cut-off the result 
 set but in the end all returned documents have score=1.0f. The final result 
 set cannot be sorted by score anymore. The result set seems to be returned in 
 the order of Lucene docId's.
 {code}
 q={!frange l=1.23}query($qq)qq=testfl=digest,score
 {code}
 {code}
 result name=response numFound=227763 start=0 maxScore=1.0
 −
 doc
 float name=score1.0/float
 str name=digestc48e784f06a051d89f20b72194b0dcf0/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str
 /doc
 −
 doc
 float name=score1.0/float
 str name=digest0f7fefa6586ceda42fc1f095d460aa17/str
 /doc
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Yury Kats (JIRA)
solr.xml persistence is broken for multicore (by SOLR-2331)
---

 Key: SOLR-2691
 URL: https://issues.apache.org/jira/browse/SOLR-2691
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 4.0
Reporter: Yury Kats
Priority: Critical


With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command,
the solr.xml gets overwritten with only the last core, repeated as many times
as there are cores.

It used to work fine with a trunk build from a couple of months ago, so it 
looks like
something broke solr.xml persistence. 

It appears to have been introduced by SOLR-2331:
CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
outside 
of the loop that iterates over cores. Therefore, all cores reuse the same map 
of attributes
and hence only the values from the last core are preserved and used for all 
cores in the list.

I'm running SolrCloud, using:
-Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun

I'm starting Solr with four cores listed in solr.xml:

solr persistent=true
  cores adminPath=/admin/cores defaultCoreName=master1
core name=master1 instanceDir=master1 shard=shard1 
collection=hcpconf /
core name=master2 instanceDir=master2 shard=shard2 
collection=hcpconf /
core name=slave1 instanceDir=slave1 shard=shard1 
collection=hcpconf /
core name=slave2 instanceDir=slave2 shard=shard2 
collection=hcpconf /
  /cores
/solr

I then issue a PERSIST request:
http://localhost:8983/solr/admin/cores?action=PERSIST

And the solr.xml turns into:

solr persistent=true
  cores defaultCoreName=master1 adminPath=/admin/cores 
zkClientTimeout=1 hostPort=8983 hostContext=solr
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
core shard=shard2 instanceDir=slave2/ name=slave2 
collection=hcpconf/
  /cores
/solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2331) Refactor CoreContainer's SolrXML serialization code and improve testing

2011-08-02 Thread Yury Kats (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078526#comment-13078526
 ] 

Yury Kats commented on SOLR-2331:
-

Looks like this introduced a regression in solr.xml persistence. 
See SOLR-2691.

 Refactor CoreContainer's SolrXML serialization code and improve testing
 ---

 Key: SOLR-2331
 URL: https://issues.apache.org/jira/browse/SOLR-2331
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 4.0

 Attachments: SOLR-2331-fix-windows-file-deletion-failure.patch, 
 SOLR-2331-fix-windows-file-deletion-failure.patch, SOLR-2331.patch


 CoreContainer has enough code in it - I'd like to factor out the solr.xml 
 serialization code into SolrXMLSerializer or something - which should make 
 testing it much easier and lightweight.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Yury Kats (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yury Kats updated SOLR-2691:


Attachment: jira2691.patch

Patch. Create map of attributes inside the loop.

 solr.xml persistence is broken for multicore (by SOLR-2331)
 ---

 Key: SOLR-2691
 URL: https://issues.apache.org/jira/browse/SOLR-2691
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 4.0
Reporter: Yury Kats
Priority: Critical
 Attachments: jira2691.patch


 With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
 command,
 the solr.xml gets overwritten with only the last core, repeated as many times
 as there are cores.
 It used to work fine with a trunk build from a couple of months ago, so it 
 looks like
 something broke solr.xml persistence. 
 It appears to have been introduced by SOLR-2331:
 CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
 outside 
 of the loop that iterates over cores. Therefore, all cores reuse the same map 
 of attributes
 and hence only the values from the last core are preserved and used for all 
 cores in the list.
 I'm running SolrCloud, using:
 -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
 -DzkRun
 I'm starting Solr with four cores listed in solr.xml:
 solr persistent=true
   cores adminPath=/admin/cores defaultCoreName=master1
 core name=master1 instanceDir=master1 shard=shard1 
 collection=hcpconf /
 core name=master2 instanceDir=master2 shard=shard2 
 collection=hcpconf /
 core name=slave1 instanceDir=slave1 shard=shard1 
 collection=hcpconf /
 core name=slave2 instanceDir=slave2 shard=shard2 
 collection=hcpconf /
   /cores
 /solr
 I then issue a PERSIST request:
 http://localhost:8983/solr/admin/cores?action=PERSIST
 And the solr.xml turns into:
 solr persistent=true
   cores defaultCoreName=master1 adminPath=/admin/cores 
 zkClientTimeout=1 hostPort=8983 hostContext=solr
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
   /cores
 /solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Did solr.xml persistence brake?

2011-08-02 Thread Yury Kats
On 8/2/2011 5:42 PM, Yury Kats wrote:

 It used to work fine with a trunk build from a couple of months ago, so it 
 looks like
 something broke solr.xml persistence. Can it be related to SOLR-2331?

Looking at the code, it does seem like a regression from SOLR-2331.
CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
outside
of the loop that iterates over cores. Therefore, all cores reuse the same map 
of attributes
and hence only the values from the last core are preserved and used for all 
cores in the list.

I opened SOLR-2691 to track and attached a patch.

Would appreciate a quick look from a committer. Thanks!


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Did solr.xml persistence brake?

2011-08-02 Thread Chris Hostetter

: I opened SOLR-2691 to track and attached a patch.
: 
: Would appreciate a quick look from a committer. Thanks!

I'm not too familiar with that code, but i can definitely reproduce the 
bug ... i'll take a look at the existing tests and see if i can help out 
with your patch.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078538#comment-13078538
 ] 

David Schlotfeldt commented on SOLR-2690:
-

Being able to specify dates in timezones other then GMT+0 isn't a problem. It 
would just be nice but we can gnore that.

The time zone the DateMathParser is configured with is the issue (which it 
sounds like you understand.) My solution that changes the timezone 
DateMathParser is constructed with in SimpleFacet to parse start, end and gap 
isn't ideal. I went this route because I don't want to run a custom built Solr 
-- my solution allowed me to fix the bug by simply replacing the facet 
SearchComponent. Affecting all DateMathParsrs created for length of the request 
is what is really needed (which is what you said). I like your approach.

It sounds like we are on the same page.

So, can we get this added? :)

Without time zone affecting DateMathParser the date faceting is useless (at 
least for 100% the situations I would use it for)

By the way, I'm gald to see how many responses there have been. I'm happy to 
see how active this project is. :)

 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.3
Reporter: David Schlotfeldt
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



CHANGES.txt for modules

2011-08-02 Thread Adriano Crestani
I can see the description of changes made to the modules are still in
contrib/CHANGES.txt. Are they going to be moved in future to a
modules/CHANGES.txt?


[jira] [Commented] (LUCENE-2979) Simplify configuration API of contrib Query Parser

2011-08-02 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078549#comment-13078549
 ] 

Adriano Crestani commented on LUCENE-2979:
--

Hi Phillipe,

Thanks for the patch. I just applied your patch for 3x. It looks good.

As you removed TestAttributes, can you create another junit to test whether 
configuration is updated when an attribute (like CharTermAttribute) is updated, 
which is basically the new functionality of the newly deprecated query parser 
attributes.

 Simplify configuration API of contrib Query Parser
 --

 Key: LUCENE-2979
 URL: https://issues.apache.org/jira/browse/LUCENE-2979
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/other
Affects Versions: 2.9, 3.0
Reporter: Adriano Crestani
Assignee: Adriano Crestani
  Labels: api-change, gsoc, gsoc2011, lucene-gsoc-11, mentor
 Fix For: 3.4, 4.0

 Attachments: LUCENE-2979_phillipe_ramalho_2.patch, 
 LUCENE-2979_phillipe_ramalho_3.patch, LUCENE-2979_phillipe_ramalho_3.patch, 
 LUCENE-2979_phillipe_ramalho_4_3x.patch, 
 LUCENE-2979_phillipe_ramalho_4_trunk.patch, 
 LUCENE-2979_phillipe_reamalho.patch


 The current configuration API is very complicated and inherit the concept 
 used by Attribute API to store token information in token streams. However, 
 the requirements for both (QP config and token stream) are not the same, so 
 they shouldn't be using the same thing.
 I propose to simplify QP config and make it less scary for people intending 
 to use contrib QP. The task is not difficult, it will just require a lot of 
 code change and figure out the best way to do it. That's why it's a good 
 candidate for a GSoC project.
 I would like to hear good proposals about how to make the API more friendly 
 and less scaring :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2691:
--

Fix Version/s: 4.0
 Assignee: Mark Miller

 solr.xml persistence is broken for multicore (by SOLR-2331)
 ---

 Key: SOLR-2691
 URL: https://issues.apache.org/jira/browse/SOLR-2691
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 4.0
Reporter: Yury Kats
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.0

 Attachments: jira2691.patch


 With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
 command,
 the solr.xml gets overwritten with only the last core, repeated as many times
 as there are cores.
 It used to work fine with a trunk build from a couple of months ago, so it 
 looks like
 something broke solr.xml persistence. 
 It appears to have been introduced by SOLR-2331:
 CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
 outside 
 of the loop that iterates over cores. Therefore, all cores reuse the same map 
 of attributes
 and hence only the values from the last core are preserved and used for all 
 cores in the list.
 I'm running SolrCloud, using:
 -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
 -DzkRun
 I'm starting Solr with four cores listed in solr.xml:
 solr persistent=true
   cores adminPath=/admin/cores defaultCoreName=master1
 core name=master1 instanceDir=master1 shard=shard1 
 collection=hcpconf /
 core name=master2 instanceDir=master2 shard=shard2 
 collection=hcpconf /
 core name=slave1 instanceDir=slave1 shard=shard1 
 collection=hcpconf /
 core name=slave2 instanceDir=slave2 shard=shard2 
 collection=hcpconf /
   /cores
 /solr
 I then issue a PERSIST request:
 http://localhost:8983/solr/admin/cores?action=PERSIST
 And the solr.xml turns into:
 solr persistent=true
   cores defaultCoreName=master1 adminPath=/admin/cores 
 zkClientTimeout=1 hostPort=8983 hostContext=solr
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
   /cores
 /solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2691:
---

Attachment: SOLR-2691.patch

patch of persistence tests at the CoreContainer level (since that's where the 
bug was)  that incorporates Yury's fix.

the assertions could definitely be beefed up to sanity check more aspects of 
the serialization, and we should really also be testing that load works and 
parses all of these things back in in the expected way as well, but it's a 
start.

The thing that's currently hanging me up is that somehow the test is leaking a 
SolrIndexSearcher reference.  I thought maybe it was because of the SolrCores i 
was creating+registering and then ignoring, but if i try to close them i get an 
error about too many decrefs instead.

I'll let miller figure it out

 solr.xml persistence is broken for multicore (by SOLR-2331)
 ---

 Key: SOLR-2691
 URL: https://issues.apache.org/jira/browse/SOLR-2691
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 4.0
Reporter: Yury Kats
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.0

 Attachments: SOLR-2691.patch, jira2691.patch


 With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
 command,
 the solr.xml gets overwritten with only the last core, repeated as many times
 as there are cores.
 It used to work fine with a trunk build from a couple of months ago, so it 
 looks like
 something broke solr.xml persistence. 
 It appears to have been introduced by SOLR-2331:
 CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
 outside 
 of the loop that iterates over cores. Therefore, all cores reuse the same map 
 of attributes
 and hence only the values from the last core are preserved and used for all 
 cores in the list.
 I'm running SolrCloud, using:
 -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
 -DzkRun
 I'm starting Solr with four cores listed in solr.xml:
 solr persistent=true
   cores adminPath=/admin/cores defaultCoreName=master1
 core name=master1 instanceDir=master1 shard=shard1 
 collection=hcpconf /
 core name=master2 instanceDir=master2 shard=shard2 
 collection=hcpconf /
 core name=slave1 instanceDir=slave1 shard=shard1 
 collection=hcpconf /
 core name=slave2 instanceDir=slave2 shard=shard2 
 collection=hcpconf /
   /cores
 /solr
 I then issue a PERSIST request:
 http://localhost:8983/solr/admin/cores?action=PERSIST
 And the solr.xml turns into:
 solr persistent=true
   cores defaultCoreName=master1 adminPath=/admin/cores 
 zkClientTimeout=1 hostPort=8983 hostContext=solr
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
 core shard=shard2 instanceDir=slave2/ name=slave2 
 collection=hcpconf/
   /cores
 /solr

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078578#comment-13078578
 ] 

David Smiley commented on SOLR-2690:


Hoss, thanks for elaborating on the distinction between the date literal and 
the DateMath timezone. I was conflating these issues in my mind -- silly me.

 Date Faceting or Range Faceting doesn't take timezone into account.
 ---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.3
Reporter: David Schlotfeldt
   Original Estimate: 3h
  Remaining Estimate: 3h

 Timezone needs to be taken into account when doing date math. Currently it 
 isn't. DateMathParser instances created are always being constructed with 
 UTC. This is a huge issue when it comes to faceting. Depending on your 
 timezone day-light-savings changes the length of a month. A facet gap of 
 +1MONTH is different depending on the timezone and the time of the year.
 I believe the issue is very simple to fix. There are three places in the code 
 DateMathParser is created. All three are configured with the timezone being 
 UTC. If a user could specify the TimeZone to pass into DateMathParser this 
 faceting issue would be resolved.
 Though it would be nice if we could always specify the timezone 
 DateMathParser uses (since date math DOES depend on timezone) its really only 
 essential that we can affect DateMathParser the SimpleFacets uses when 
 dealing with the gap of the date facets.
 Another solution is to expand the syntax of the expressions DateMathParser 
 understands. For example we could allow (?timeZone=VALUE) to be added 
 anywhere within an expression. VALUE would be the id of the timezone. When 
 DateMathParser reads this in sets the timezone on the Calendar it is using.
 Two examples:
 - (?timeZone=America/Chicago)NOW/YEAR
 - (?timeZone=America/Chicago)+1MONTH
 I would be more then happy to modify DateMathParser and provide a patch. I 
 just need a committer to agree this needs to be resolved and a decision needs 
 to be made on the syntax used
 Thanks!
 David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 129 - Failure

2011-08-02 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/129/

2 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.MultiCoreExampleJettyTest.testMultiCore

Error Message:
Severe errors in solr configuration.  Check your log files for more detailed 
information on what may be wrong.  
- 
java.lang.RuntimeException: java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890
 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock 
(No such file or directory)  at 
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392)  at 
org.apache.solr.core.SolrCore.init(SolrCore.java:562)  at 
org.apache.solr.core.SolrCore.init(SolrCore.java:509)  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet  Severe errors 
in solr configuration.  Check your log files for more detailed information on 
what may be wrong.  
- 
java.lang.RuntimeException: java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890
 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock 
(No such file or directory)  at 
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392)  at 
org.apache.solr.core.SolrCore.init(SolrCore.java:562)  at 
org.apache.solr.core.SolrCore.init(SolrCore.java:509)  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet  request: 
http://localhost:27238/example/core0/update?commit=truewaitSearcher=truewt=javabinversion=2

Stack Trace:


request: 
http://localhost:27238/example/core0/update?commit=truewaitSearcher=truewt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:434)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104)
at 

[jira] [Created] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

2011-08-02 Thread Trejkaz (JIRA)
StandardTokenizer disposes of Hiragana combining mark dakuten instead of 
attaching it to the character it belongs to


 Key: LUCENE-3358
 URL: https://issues.apache.org/jira/browse/LUCENE-3358
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.3
Reporter: Trejkaz


Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for 
tokenising hiragana, if combining marks are in use.

Here's a unit test:

{code}
@Test
public void testHiraganaWithCombiningMarkDakuten() throws Exception
{
// Hiragana 'S' following by the combining mark dakuten
TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new 
StringReader(\u3055\u3099));

// Should be kept together.
ListString expectedTokens = Arrays.asList(\u3055\u3099);
ListString actualTokens = new LinkedListString();
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
while (stream.incrementToken())
{
actualTokens.add(term.toString());
}

assertEquals(Wrong tokens, expectedTokens, actualTokens);

}
{code}

This code fails with:
{noformat}
java.lang.AssertionError: Wrong tokens expected:[ざ] but was:[さ]
{noformat}

It seems as if the tokeniser is throwing away the combining mark entirely.

3.0's behaviour was also undesirable:
{noformat}
java.lang.AssertionError: Wrong tokens expected:[ざ] but was:[さ, ゙]
{noformat}

But at least the token was there, so it was possible to write a filter to work 
around the issue.

Katakana seems to be avoiding this particular problem, because all katakana and 
combining marks found in a single run seem to be lumped into a single token 
(this is a problem in its own right, but I'm not sure if it's really a bug.)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org