RE: Need help building JCC on windows
Dear Baseer, I've never tried with Mingw32 but succeeded in a built with the Microsoft Visual Studio (Express 9.0) on Win7/32bit. Eggs are available for download at http://code.google.com/p/pylucene-win32-binary/downloads/list If you're only looking for JCC - version 2.6 is available for download. If you need a more recent version I can try to build and upload. Of course that doesn't help you with the Mingw issue. Maybe someone else has more experience with that one... Regards, Thomas -- OrbiTeam Software GmbH Co. KG Endenicher Allee 35 53121 Bonn Germany http://www.orbiteam.de -Original Message- From: Baseer Khan [mailto:bas...@yahoo.com] Sent: Thursday, May 19, 2011 6:26 AM To: pylucene-dev@lucene.apache.org Subject: Need help building JCC on windows I don't seem to resolve the undefined reference issue while building JCC on windows 7 using Mingw32 compiler. Any help? C:\glassfish3\jdk running install running bdist_egg running egg_info writing JCC.egg-info\PKG-INFO writing top-level names to JCC.egg-info\top_level.txt writing dependency_links to JCC.egg-info\dependency_links.txt reading manifest template 'MANIFEST.in' writing manifest file 'JCC.egg-info\SOURCES.txt' installing library code to build\bdist.win32\egg running install_lib running build_py writing C:\Users\baseer\devenv\jcc\jcc\config.py copying jcc\config.py - build\lib.win32-2.7\jcc copying jcc\cpp.py - build\lib.win32-2.7\jcc copying jcc\python.py - build\lib.win32-2.7\jcc copying jcc\windows.py - build\lib.win32-2.7\jcc copying jcc\__init__.py - build\lib.win32-2.7\jcc copying jcc\__main__.py - build\lib.win32-2.7\jcc copying jcc\sources\functions.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\JArray.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\jcc.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\JCCEnv.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\JObject.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\types.cpp - build\lib.win32-2.7\jcc\sources copying jcc\sources\functions.h - build\lib.win32-2.7\jcc\sources copying jcc\sources\JArray.h - build\lib.win32-2.7\jcc\sources copying jcc\sources\JCCEnv.h - build\lib.win32-2.7\jcc\sources copying jcc\sources\jccfuncs.h - build\lib.win32-2.7\jcc\sources copying jcc\sources\JObject.h - build\lib.win32-2.7\jcc\sources copying jcc\sources\macros.h - build\lib.win32-2.7\jcc\sources copying jcc\patches\patch.4195 - build\lib.win32-2.7\jcc\patches copying jcc\patches\patch.43.0.6c11 - build\lib.win32-2.7\jcc\patches copying jcc\patches\patch.43.0.6c7 - build\lib.win32-2.7\jcc\patches copying jcc\jcc.lib - build\lib.win32-2.7\jcc copying jcc\classes\org\apache\jcc\PythonVM.class - build\lib.win32- 2.7\jcc\cla sses\org\apache\jcc copying jcc\classes\org\apache\jcc\PythonException.class - build\lib.win32- 2.7\ jcc\classes\org\apache\jcc running build_ext building 'jcc' extension C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -D_jcc_lib -DJCC_VER=2.8 - IC:\ glassfish3\jdk/include -IC:\glassfish3\jdk/include/win32 -I_jcc -Ijcc/sources -I C:\Python27\include -IC:\Python27\PC -c jcc/sources/jcc.cpp -o build\temp.win32- 2.7\Release\jcc\sources\jcc.o -DPYTHON -D_JNI_IMPLEMENTATION_ -fno-strict- aliasi ng -Wno-write-strings C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -D_jcc_lib -DJCC_VER=2.8 - IC:\ glassfish3\jdk/include -IC:\glassfish3\jdk/include/win32 -I_jcc -Ijcc/sources -I C:\Python27\include -IC:\Python27\PC -c jcc/sources/JCCEnv.cpp -o build\temp.win 32-2.7\Release\jcc\sources\jccenv.o -DPYTHON -D_JNI_IMPLEMENTATION_ -fno- strict- aliasing -Wno-write-strings writing build\temp.win32-2.7\Release\jcc\sources\jcc.def C:\MinGW\bin\g++.exe -mno-cygwin -shared -Wl,--out-implib,build\lib.win32- 2.7\jc c\jcc.lib -s build\temp.win32-2.7\Release\jcc\sources\jcc.o build\temp.win32- 2.7 \Release\jcc\sources\jccenv.o build\temp.win32-2.7\Release\jcc\sources\jcc.def - LC:\Python27\libs -LC:\Python27\PCbuild -lpython27 -lmsvcr90 -o build\lib.win32- 2.7\jcc.dll -LC:\glassfish3\jdk/lib -ljvm -Wl,-S -Wl,--out-implib,jcc\jcc.lib Creating library file: jcc\jcc.lib build\temp.win32-2.7\Release\jcc\sources\jcc.o:jcc.cpp:(.text+0xc30): undefined reference to `JNI_GetDefaultJavaVMInitArgs@4' build\temp.win32-2.7\Release\jcc\sources\jcc.o:jcc.cpp:(.text+0xed0): undefined reference to `JNI_CreateJavaVM@12' collect2: ld returned 1 exit status error: command 'g++' failed with exit status 1
Re: FST and FieldCache?
Hi David, but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The Yep. Not only, but this is one of the use cases. Will you be at Lucene Revolution next week? I'll be talking about it there. representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? Correct me if my understanding is wrong: you'd like to assign a unique integer to each String and then retrieve it by this integer (something like a MapInteger, String)? This would be something called perfect hashing and this can be done on top of an automaton (fairly easily). I assume the data structure is immutable once constructed and does not change too often, right? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs
[ https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036028#comment-13036028 ] Simon Willnauer commented on LUCENE-2981: - bq. +1 to slash and burn. +1 go for it! Review and potentially remove unused/unsupported Contribs - Key: LUCENE-2981 URL: https://issues.apache.org/jira/browse/LUCENE-2981 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Fix For: 3.2, 4.0 Attachments: LUCENE-2981.patch Some of our contribs appear to be lacking for development/support or are missing tests. We should review whether they are even pertinent these days and potentially deprecate and remove them. One of the things we did in Mahout when bringing in Colt code was to mark all code that didn't have tests as @deprecated and then we removed the deprecation once tests were added. Those that didn't get tests added over about a 6 mos. period of time were removed. I would suggest taking a hard look at: ant db lucli swing (spatial should be gutted to some extent and moved to modules) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036038#comment-13036038 ] Simon Willnauer commented on SOLR-1942: --- some minor comments: * s/nothit/nothing in // make sure we use the default if nothit is configured * add javadoc to CodecProvider#hasFieldCodec(String) * SchemaCodecProvider should maybe add its name in toString() and not just delegate * Maybe we should note in the CHANGES.TXT that IndexReaderFactory now has a CodecProvider that should be passed to IR#open() otherwise it looks good though! Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
On Wed, May 18, 2011 at 10:53 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : just a few words. I disagree here with you hoss IMO the suggestion to : merge JIRA would help to move us closer together and help close the : gap between Solr and Lucene. I think we need to start identifying us : with what we work on. It feels like we don't do that today and we : should work hard to stop that and make hard breaks that might hurt but I just don't see how you think that would help anything ... we still need to distinguish Jira issues to identify where in the stack they affect. If there is a divide among the developers because of the niches where they tend to work, will that divide magicly go away because we partition all issues using the component feature of instead of by the Jira project feature? I don't really see how that makes any sense. Even if we all thought it did, and even if the cost/effort of migrating/converting were totally free, the user bases (who interact with the Solr APIs vs directly using the Lucene-Core/Module APIs) are so distinct that I genuinely think sticking with distinct Jira Projects makes more sense for our users. : JIRA. I'd go even further and nuke the name entirely and call : everything lucene - I know not many folks like the idea and it might : take a while to bake in but I think for us (PMC / Committers) and the Everything already is called Lucene ... the Project is Apache Lucene the community is Lucene ... the Lucene project currently releases several products, and one of them is called Apache Solr ... if you're suggestion that we should ultimately elimianate the name Solr then we'd still have to decide what we're going going to call that end product, the artifact that we ship that provides the abstraction layer that Solr currently provides. Even if you mean to suggest that we should only have one unified product -- one singular release artifact -- that abstraction layer still needs a name. The name we have now is Solr, it has brand awareness and a user base who understands what it means to say they are Installing Solr or that a new feature is available when Using Solr Eliminating that name doesn't seem like it would benefit the user community in anyway. What I was saying / trying to say is that we as the community should move closer together. In all our minds, especially in the users mind Solr is a Project and Lucene is a Project. If we'd start over I would propose something like Lucene-httpd or something similar. But don't get me wrong I just went one step further than Shai since I think his idea made sense. I don't think all that would be a big issue to users - they use the http interface and they don't give a shit if its called solr or not. For us I think it makes a big difference. In our minds though. I agree with you that solr is a product and lucene is the project but we should enforce this. Like all namespaces say o.a.solr not o.a.lucene.solr so it implies we are two projects which is not true. I am not sure how we should proceed here but to change our minds we must change facts. Just my opinion. simon -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
You cannot get a string out of automaton by its ordinal without storing additional data. The string is stored there not as a single arc, but as a sequence of them (basically.. err.. as a string), so referencing them is basically writing the string asis. Space savings here come from sharing arcs between strings. Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) Perfect hashing, on the other hand, is like a MapString, Integer that accepts a predefined set of N strings and returns an int in 0..N-1 interval. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I've been pondering how to reduce the size of FieldCache entries when there are a large number of Strings. I'd like to facet on such a field with Solr but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) It doesn't need to be invented -- it's a known technique. On each arc you store the number of strings under that arc; while traversing you accumulate -- this gives you a unique number for each string (perfect hash) and a way to locate a string given its number. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? You can do both the way I described above. Jan Daciuk has details on many more variants of doing that: Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal Bottom-up Deterministic Tree Automata, Intelligent Information Systems XVI, Proceedings of the International IIS'08 Conference held in Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.), Academic Publishing House Exit, Warszawa 2008. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] olivier soyez updated LUCENE-3119: -- Description: I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: {noformat} #234;#226;#238;#244;#251;]}}} {noformat} was: I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: {noformat} #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] olivier soyez updated LUCENE-3119: -- Priority: Minor (was: Major) Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] olivier soyez updated LUCENE-3119: -- Description: I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} was: I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: {noformat} #234;#226;#238;#244;#251;]}}} {noformat} Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true
span query matches too many docs when two query terms are the same unless inOrder=true -- Key: LUCENE-3120 URL: https://issues.apache.org/jira/browse/LUCENE-3120 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc]. With 3 documents: * a b x c d * a b b d * a b x b y d Here are a few queries (the number in parenthesis indicates expected #hits): These ones work *as expected*: * (1) in-order, slop=0, b, x, b * (1) in-order, slop=0, b, b * (2) in-order, slop=1, b, b These ones match *too many* hits: * (1) any-order, slop=0, b, x, b * (1) any-order, slop=1, b, x, b * (1) any-order, slop=2, b, x, b * (1) any-order, slop=3, b, x, b These ones match *too many* hits as well: * (1) any-order, slop=0, b, b * (2) any-order, slop=1, b, b Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query). This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true
[ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3120: Attachment: LUCENE-3120.patch Attached test case demonstrating the bug. span query matches too many docs when two query terms are the same unless inOrder=true -- Key: LUCENE-3120 URL: https://issues.apache.org/jira/browse/LUCENE-3120 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3120.patch spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc]. With 3 documents: * a b x c d * a b b d * a b x b y d Here are a few queries (the number in parenthesis indicates expected #hits): These ones work *as expected*: * (1) in-order, slop=0, b, x, b * (1) in-order, slop=0, b, b * (2) in-order, slop=1, b, b These ones match *too many* hits: * (1) any-order, slop=0, b, x, b * (1) any-order, slop=1, b, x, b * (1) any-order, slop=2, b, x, b * (1) any-order, slop=3, b, x, b These ones match *too many* hits as well: * (1) any-order, slop=0, b, b * (2) any-order, slop=1, b, b Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query). This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true
[ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036080#comment-13036080 ] Greg Tarr commented on LUCENE-3120: --- Thanks for raising this. span query matches too many docs when two query terms are the same unless inOrder=true -- Key: LUCENE-3120 URL: https://issues.apache.org/jira/browse/LUCENE-3120 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3120.patch spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc]. With 3 documents: * a b x c d * a b b d * a b x b y d Here are a few queries (the number in parenthesis indicates expected #hits): These ones work *as expected*: * (1) in-order, slop=0, b, x, b * (1) in-order, slop=0, b, b * (2) in-order, slop=1, b, b These ones match *too many* hits: * (1) any-order, slop=0, b, x, b * (1) any-order, slop=1, b, x, b * (1) any-order, slop=2, b, x, b * (1) any-order, slop=3, b, x, b These ones match *too many* hits as well: * (1) any-order, slop=0, b, b * (2) any-order, slop=1, b, b Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query). This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
We should add this (lookup by value, when value is guaranteed to monotonically increase as the key increases) to our core FST APIs? It's generically useful in many places ;) I'll open an issue. EG this would also enable an FST terms index that supports lookup-by-ord, something VariableGapTermsIndex (this is the one that uses FST for the index) does not support today. David, one thing to remember is trunk has already seen drastic reductions on the RAM required to store DocTerms/Index vs 3.x (something maybe we should backport to 3.x...). The bytes for the terms are now stored as shared byte[] blocks, and the ords/offsets are stored as packed ints, so we no longer have per-String memory pointer overhead. I describe the gains here: http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html -- though, those gains include RAM reduction from the terms index as well. While FST w/ lookup-by-monotonic-value would work here, I would be worried about the per hit of that representation vs what DocTerms/Index offers today... we should test to see. Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). Mike http://blog.mikemccandless.com On Thu, May 19, 2011 at 4:43 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) It doesn't need to be invented -- it's a known technique. On each arc you store the number of strings under that arc; while traversing you accumulate -- this gives you a unique number for each string (perfect hash) and a way to locate a string given its number. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? You can do both the way I described above. Jan Daciuk has details on many more variants of doing that: Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal Bottom-up Deterministic Tree Automata, Intelligent Information Systems XVI, Proceedings of the International IIS'08 Conference held in Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.), Academic Publishing House Exit, Warszawa 2008. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3121) FST should offer lookup-by-output API when output strictly increases
FST should offer lookup-by-output API when output strictly increases Key: LUCENE-3121 URL: https://issues.apache.org/jira/browse/LUCENE-3121 Project: Lucene - Java Issue Type: Improvement Components: core/other Reporter: Michael McCandless Fix For: 4.0 Spinoff from FST and FieldCache java-dev thread http://lucene.markmail.org/thread/swoawlv3fq4dntvl FST is able to associate arbitrary outputs with the sorted input keys, but in the special (and, common) case where the function is strictly monotonic (each output only increases vs prior outputs), such as mapping to term ords or mapping to file offsets in the terms dict, we should offer a lookup-by-output API that efficiently walks the FST and locates input key (exact or floor or ceil) matching that output. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 6:16 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: We should add this (lookup by value, when value is guaranteed to monotonically increase as the key increases) to our core FST APIs? It's generically useful in many places ;) I'll open an issue. The data structure itself should sort of build itself if you create an FST with increasing integers because the shared suffix should be pushed towards the root anyway, so the only thing would be to correct values on all outgoing arcs (they need to contain the count of leaves on the subtree) but then, this may be tricky if arc values are vcoded... I'd have to think how to do this. I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) While FST w/ lookup-by-monotonic-value would work here, I would be worried about the per hit of that representation vs what There are actually two things: a) performance; you need to descend in the automaton and some bookkeeping to maintain the count of nodes; this adds overhead, b) size; the procedure for storing/ calculating perfect hashes I described requires leaf counts on each arc and these are usually large integers. Even vcoded they bloat the resulting data structure. Maybe we should iterate on the issue to get down to the specifics? I had thought there wouldn't be any backtracking, if the FST had stored the ord as an output... Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036104#comment-13036104 ] Michael McCandless commented on SOLR-2519: -- +1 to naming these fields text_example_XXX. That's a great idea Jan. I'll do that in my next patch... Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2526) Grouping on multiple fields
[ https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036105#comment-13036105 ] Michael McCandless commented on SOLR-2526: -- Martin, that's a great point -- once we've factored out FunctionQuery, it should be easy to make an FQ (does one already exist?) that holds an N-tuple of other FQ values. Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036107#comment-13036107 ] Doron Cohen commented on LUCENE-3068: - Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs. A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point... The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3122) Cascaded grouping
Cascaded grouping - Key: LUCENE-3122 URL: https://issues.apache.org/jira/browse/LUCENE-3122 Project: Lucene - Java Issue Type: Improvement Components: modules/grouping Reporter: Michael McCandless Fix For: 3.2, 4.0 Similar to SOLR-2526, in that you are grouping on 2 separate fields, but instead of treating those fields as a single grouping by a compound key, this change would let you first group on key1 for the primary groups and then secondarily on key2 within the primary groups. Ie, the result you get back would have groups A, B, C (grouped by key1) but then the documents within group A would be grouped by key 2. I think this will be important for apps whose documents are the product of denormalizing, ie where the Lucene document is really a sub-document of a different identifier field. Borrowing an example from LUCENE-3097, you have doctors but each doctor may have multiple offices (addresses) where they practice and so you index doctor X address as your lucene documents. In this case, your identifier field (that which counts for facets, and should be grouped for presentation) is doctorid. When you offer users search over this index, you'd likely want to 1) group by distance (ie, 0.1 miles, 0.2 miles, etc., as a function query), but 2) also group by doctorid, ie cascaded grouping. I suspect this would be easier to implement than it sounds: the per-group collector used by the 2nd pass grouping collector for key1's grouping just needs to be another grouping collector. Spookily, though, that collection would also have to be 2-pass, so it could get tricky since grouping is sort of recursing on itself once we have LUCENE-3112, though, that should enable efficient single pass grouping by the identifier (doctorid). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2526) Grouping on multiple fields
[ https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036105#comment-13036105 ] Michael McCandless edited comment on SOLR-2526 at 5/19/11 10:47 AM: Martijn, that's a great point -- once we've factored out FunctionQuery, it should be easy to make an FQ (does one already exist?) that holds an N-tuple of other FQ values. was (Author: mikemccand): Martin, that's a great point -- once we've factored out FunctionQuery, it should be easy to make an FQ (does one already exist?) that holds an N-tuple of other FQ values. Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036109#comment-13036109 ] Michael McCandless commented on LUCENE-3068: bq. A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point.. Good idea! I'll try to do this... Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil), so that's a [cumbersome] way to get the svn rev. The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036111#comment-13036111 ] Doron Cohen commented on LUCENE-3068: - bq. Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil) Great, thanks! So, this commit to trunk in r1124293 falls between these two: - Tue 17/05/2011 Lucene/Solr trunk rev 1104671 - Wed 18/05/2011 Lucene/Solr trunk rev 1124524 ... No measurable degradation, good! The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036124#comment-13036124 ] Jörn Kottmann commented on LUCENE-2899: --- The first release is now out. I guess you will use maven for dependency management, you can find here how to add the released version as a dependency: http://incubator.apache.org/opennlp/maven-dependency.html Add OpenNLP Analysis capabilities as a module - Key: LUCENE-2899 URL: https://issues.apache.org/jira/browse/LUCENE-2899 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Grant Ingersoll Priority: Minor Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does: * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens) * NamedEntity recognition as a TokenFilter We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. I'd propose it go under: modules/analysis/opennlp -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #127: POMs out of sync
Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/127/ No tests ran. Build Log (for compile errors): [...truncated 15564 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036140#comment-13036140 ] Simon Willnauer commented on LUCENE-3068: - bq. Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs. I love it! good that all the work on LuceneUtil pays off! The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036142#comment-13036142 ] Greg Tarr commented on LUCENE-1877: --- Yes, we have multiple machines being able to write to the same index on the SAN. Use NativeFSLockFactory as default for new API (direct ctors FSDir.open) -- Key: LUCENE-1877 URL: https://issues.apache.org/jira/browse/LUCENE-1877 Project: Lucene - Java Issue Type: Improvement Components: general/javadocs Reporter: Mark Miller Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch A user requested we add a note in IndexWriter alerting the availability of NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm exit). Seems reasonable to me - we want users to be able to easily stumble upon this class. The below code looks like a good spot to add a note - could also improve whats there a bit - opening an IndexWriter does not necessarily create a lock file - that would depend on the LockFactory used. {code} pOpening an codeIndexWriter/code creates a lock file for the directory in use. Trying to open another codeIndexWriter/code on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException} is also thrown if an IndexReader on the same directory is used to delete documents from the index./p{code} Anyone remember why NativeFSLockFactory is not the default over SimpleFSLockFactory? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036150#comment-13036150 ] Koji Sekiguchi commented on LUCENE-3119: I can reproduce this, but it is due to HtmlEncoder in solrconfig.xml (I've mentioned it in the mail thread), and not code change. Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This (storing sums) is, I think, exactly what the FST stores as outputs on the arcs. Ie, it bounds the range of outputs if you were to recurse on that arc. So, from any node, we can unambiguously determine which arc to recurse on, when looking up by value (only if the value is strictly monotonic). It should be straightforward to implement, ie should not require any additional data structure / storage in the FST. It's a lookup-only change, I think. Mike http://blog.mikemccandless.com On Thu, May 19, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote: I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
That's what I invented, and yes, it was invented by countless people before :) You know I didn't mean to sound rude, right? I'm really admiring your ability to come up with these solutions by yourself, I'm merely copying other folks' ideas. Anyway, the optimization you're describing is sure possible. Lucene's FST implementation can actually combine both approaches because always expanding nodes is inefficient and those already expanded will allow a binary search (assuming the automaton structure is known to the implementation). Another refinement of this idea creates a detached table (err.. index :) of states to start from inside the automaton, so that you don't have to go through the initial 2-3 states which are more or less always large and even binary search is costly there. Dawid
[jira] [Created] (SOLR-2528) remove HtmlEncoder (or set it to default=false) from example solrconfig.xml
remove HtmlEncoder (or set it to default=false) from example solrconfig.xml --- Key: SOLR-2528 URL: https://issues.apache.org/jira/browse/SOLR-2528 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 3.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.1.1, 3.2, 4.0 After 3.1 released, highlight snippets that include non ascii characters are encoded to character references by HtmlEncoder if it is set in solrconfig.xml. Because solr example config has it, not a few users got confused by the output. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2528) remove HtmlEncoder from example solrconfig.xml (or set it to default=false)
[ https://issues.apache.org/jira/browse/SOLR-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2528: - Summary: remove HtmlEncoder from example solrconfig.xml (or set it to default=false) (was: remove HtmlEncoder (or set it to default=false) from example solrconfig.xml) remove HtmlEncoder from example solrconfig.xml (or set it to default=false) --- Key: SOLR-2528 URL: https://issues.apache.org/jira/browse/SOLR-2528 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 3.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.1.1, 3.2, 4.0 After 3.1 released, highlight snippets that include non ascii characters are encoded to character references by HtmlEncoder if it is set in solrconfig.xml. Because solr example config has it, not a few users got confused by the output. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2528) remove HtmlEncoder from example solrconfig.xml (or set it to default=false)
[ https://issues.apache.org/jira/browse/SOLR-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2528: - Attachment: SOLR-2528.patch remove HtmlEncoder from example solrconfig.xml (or set it to default=false) --- Key: SOLR-2528 URL: https://issues.apache.org/jira/browse/SOLR-2528 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 3.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.1.1, 3.2, 4.0 Attachments: SOLR-2528.patch After 3.1 released, highlight snippets that include non ascii characters are encoded to character references by HtmlEncoder if it is set in solrconfig.xml. Because solr example config has it, not a few users got confused by the output. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036164#comment-13036164 ] Koji Sekiguchi commented on LUCENE-3119: I opened SOLR-2528. Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
[ https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036163#comment-13036163 ] Doron Cohen commented on LUCENE-3123: - This is on Ubuntu btw. Run log: {noformat} NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 The following exceptions were thrown by threads: *** Thread: Lucene Merge Thread #0 *** org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: /tmp/test4907593285402510583tmp/_51_0.sd (Too many open files) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:507) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:472) Caused by: java.io.FileNotFoundException: /tmp/test4907593285402510583tmp/_51_0.sd (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:233) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.init(SimpleFSDirectory.java:69) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.init(SimpleFSDirectory.java:90) at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:56) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:337) at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:402) at org.apache.lucene.index.codecs.mockrandom.MockRandomCodec.fieldsProducer(MockRandomCodec.java:236) at org.apache.lucene.index.PerFieldCodecWrapper$FieldsReader.init(PerFieldCodecWrapper.java:113) at org.apache.lucene.index.PerFieldCodecWrapper.fieldsProducer(PerFieldCodecWrapper.java:210) at org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:131) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:495) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:635) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3260) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2930) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:379) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:447) NOTE: test params are: codec=RandomCodecProvider: {field=MockRandom}, locale=nl_NL, timezone=Turkey NOTE: all tests run in this JVM: [TestIndexWriter] NOTE: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2,free=26480072,total=33468416 {noformat} TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: That's what I invented, and yes, it was invented by countless people before :) You know I didn't mean to sound rude, right? I'm really admiring your ability to come up with these solutions by yourself, I'm merely copying other folks' ideas. I tried to prevent another reference to mr. Daciuk :) Anyway, the optimization you're describing is sure possible. Lucene's FST implementation can actually combine both approaches because always expanding nodes is inefficient and those already expanded will allow a binary search (assuming the automaton structure is known to the implementation). Another refinement of this idea creates a detached table (err.. index :) of states to start from inside the automaton, so that you don't have to go through the initial 2-3 states which are more or less always large and even binary search is costly there. Dawid But you have to lookup this err..index somehow. And that's either binary or hash lookup. Where's the win? -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I tried to prevent another reference to mr. Daciuk :) Why? He's a very nice guy :) But you have to lookup this err..index somehow. And that's either binary or hash lookup. Where's the win? You can do a sparse O(1) index and have a slight gain from these few large initial states. This only makes sense if you perform tons of these lookups, really. Mike's right -- the FST will output a structure that is ready to be used for by-number retrieval of strings (or anything else) as long as the numbers are strictly monotonous (and preferably continuous). The output will be what you're suggesting, Earwin -- accumulated sums. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote: 2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036173#comment-13036173 ] olivier soyez commented on LUCENE-3119: --- Yes, it is just due to HtmlEncoder in solrconfig.xml. Thank you so much, it's working! Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] olivier soyez resolved LUCENE-3119. --- Resolution: Not A Problem Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update
[ https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] olivier soyez closed LUCENE-3119. - Special Character Hightlighting issues after 3.1.0 update --- Key: LUCENE-3119 URL: https://issues.apache.org/jira/browse/LUCENE-3119 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Affects Versions: 3.1 Environment: ubuntu 10.10, java version 1.6.0_02 Reporter: olivier soyez Priority: Minor I have the same issue describe here : http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none Looks like the highlighting code changed. Using the example doc, with 1.4 I get : http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true {noformat} highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: êâîôû]}}} With 3.1, this now looks like : highlighting:{ UTF8TEST:{ features:[eaiou with emcircumflexes/em: #234;#226;#238;#244;#251;]}}} {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3112: --- Attachment: LUCENE-3112.patch New patch, I think it's ready to commit but it could use some healthy reviewing... I fang'd up TestNRThreads to add/update doc blocks and verify the docs in each block remain adjacent, and also added a couple other test cases to make sure we test non-aborting exceptions when adding a doc block. And I put warning in the jdocs about possible future full re-indexing. Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch, LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036177#comment-13036177 ] Robert Muir commented on SOLR-1942: --- Hi Simon, after reviewing the patch I have some concerns about CodecProvider. I think its a little bit confusing how the CodecProvider/CoreCodecProvider hierarchy works today, and a bit dangerous how we delegate over this class. For example, if we add a new method to CodecProvider, we need to be sure we add the 'delegation' here every time or stuff will start acting strange. For this reason, I wonder if CodecProvider should be an interface: the simple implementation we have in lucene is a hashmap, but Solr uses fieldType lookup. This would parallel how SimilarityProvider works. If we want to do this, I think we should open a separate issue... in fact I'm not even sure it should block this issue since in my opinion its a shame you cannot manipulate codecs in Solr right now... but I just wanted to bring it up here. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This is more about compressing strings in TermsIndex, I think. And ability to use said TermsIndex directly in some cases that required FieldCache before. (Maybe FC is still needed, but it can be degraded to docId-ord map, storing actual strings in TI). This yields fat space savings when we, eg, need to both lookup on a field and build facets out of it. mmap is cool :) What I want to see is a FST-based TermsDict that is simply mmaped into memory, without building intermediate indexes, like Lucene does now. And docvalues are orthogonal to that, no? On Thu, May 19, 2011 at 17:22, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote: 2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036182#comment-13036182 ] Michael McCandless commented on SOLR-1942: -- I agree the CodecProvider/CoreCodecProvider is a scary potential delegation trap... Robert can you open a new issue? I agree it should not block this one. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This is more about compressing strings in TermsIndex, I think. Ah, because they're sorted. I think if the string lookup cost degrades then it's not worth it? That's something that needs to be tested in the MMap case as well, eg, are ByteBuffers somehow slowing down everything by a factor of 10%? On Thu, May 19, 2011 at 6:30 AM, Earwin Burrfoot ear...@gmail.com wrote: This is more about compressing strings in TermsIndex, I think. And ability to use said TermsIndex directly in some cases that required FieldCache before. (Maybe FC is still needed, but it can be degraded to docId-ord map, storing actual strings in TI). This yields fat space savings when we, eg, need to both lookup on a field and build facets out of it. mmap is cool :) What I want to see is a FST-based TermsDict that is simply mmaped into memory, without building intermediate indexes, like Lucene does now. And docvalues are orthogonal to that, no? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Hopefully DocValues will replace FieldCache over time; maybe some day we can deprecate remove FieldCache. But we still have work to do there, I believe; eg we don't have comparators for all types (on the docvalues branch) yet. Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? In fact, the packed ints and the byte[] packing of terms data is very much amenable/necessary for using MMap, far moreso than the separate objects we had before. I agree we should make an mmap option, though I would generally recommend against apps using mmap for these caches. We load these caches so that we'll have fast random access to potentially a great many documents during collection of one query (eg for sorting). When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in. Various other data structures in Lucene needs this fast random access (norms, del docs, terms index) and that's why we put them in RAM. I do agree for all else (the lrge postings), MMap is great. Of course the OS swaps out process RAM anyway, so... it's kinda moot (unless you've fixed your OS to not do this, which I always do!). I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3124) review CodecProvider/CoreCodecProvider/SchemaCodecProvider hierarchy
review CodecProvider/CoreCodecProvider/SchemaCodecProvider hierarchy Key: LUCENE-3124 URL: https://issues.apache.org/jira/browse/LUCENE-3124 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir As mentioned on SOLR-1942, I think we should revisit the CodecProvider hierarchy. Its a little bit confusing how the class itself isn't really abstract but is really an overridable implementation. One idea would be to make CodecProvider an interface, with Lucene using a simple hashmap-backed impl and Solr using the schema-backed impl. This would be in line with how SimilarityProvider was done. It would also be good to review all the methods in CodecProvider and see if we can minimize the interface... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036184#comment-13036184 ] Robert Muir commented on SOLR-1942: --- OK I opened LUCENE-3124 for this Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2529) DIH update trouble with sql field name pk
DIH update trouble with sql field name pk --- Key: SOLR-2529 URL: https://issues.apache.org/jira/browse/SOLR-2529 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.2 Environment: Debian Lenny, JRE 6 Reporter: Thomas Gambier Priority: Blocker We are unable to use the DIH when database columnName primary key is named pk. The reported solr error is : deltaQuery has no column to resolve to declared primary key pk='pk' We have made some investigations and found that the DIH have a mistake when it's looking for the primary key between row's columns list. private String findMatchingPkColumn(String pk, Map row) { if (row.containsKey(pk)) throw new IllegalArgumentException( String.format(deltaQuery returned a row with null for primary key %s, pk)); String resolvedPk = null; for (String columnName : row.keySet()) { if (columnName.endsWith(. + pk) || pk.endsWith(. + columnName)) { if (resolvedPk != null) throw new IllegalArgumentException( String.format( deltaQuery has more than one column (%s and %s) that might resolve to declared primary key pk='%s', resolvedPk, columnName, pk)); resolvedPk = columnName; } } if (resolvedPk == null) throw new IllegalArgumentException( String.format(deltaQuery has no column to resolve to declared primary key pk='%s', pk)); LOG.info(String.format(Resolving deltaQuery column '%s' to match entity's declared pk '%s', resolvedPk, pk)); return resolvedPk; } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.
Not just this version, Lucene.Net 2.9.4 also can read (in theory) the index created in 3.0.3. But I haven't tested it myself. DIGY. -Original Message- From: Alexander Bauer [mailto:a...@familie-bauer.info] Sent: Thursday, May 19, 2011 8:37 AM To: lucene-net-...@lucene.apache.org Subject: Re: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics. Can i use this version with an existing index based on lucene.Java 3.0.3 ? Alex Am 19.05.2011 00:20, schrieb Digy (JIRA): [ https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035795#comment-13035795 ] Digy commented on LUCENENET-412: Hi All, Lucene.Net 2.9.4g is almost ready for testing feedbacks. While injecting generics making some clean up in code, I tried to be close to lucene 3.0.3 as much as possible. Therefore it's position is somewhere between lucene.Java 2.9.4 3.0.3 DIGY PS: For those who might want to try this version: It won't probably be a drop-in replacement since there are a few API changes like - StopAnalyzer(Liststring stopWords) - Query.ExtractTerms(ICollectionstring) - TopDocs.*TotalHits*, TopDocs.*ScoreDocs* and some removed methods/classes like - Filter.Bits - JustCompileSearch - Contrib/Similarity.Net Replacing ArrayLists, Hashtables etc. with appropriate Generics. Key: LUCENENET-412 URL: https://issues.apache.org/jira/browse/LUCENENET-412 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.4 Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4 Attachments: IEquatable for QuerySubclasses.patch, LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some performance gains. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036189#comment-13036189 ] Simon Willnauer commented on SOLR-1942: --- bq. OK I opened LUCENE-3124 for this +1 thanks! good point! Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. On Thu, May 19, 2011 at 6:40 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Hopefully DocValues will replace FieldCache over time; maybe some day we can deprecate remove FieldCache. But we still have work to do there, I believe; eg we don't have comparators for all types (on the docvalues branch) yet. Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? In fact, the packed ints and the byte[] packing of terms data is very much amenable/necessary for using MMap, far moreso than the separate objects we had before. I agree we should make an mmap option, though I would generally recommend against apps using mmap for these caches. We load these caches so that we'll have fast random access to potentially a great many documents during collection of one query (eg for sorting). When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in. Various other data structures in Lucene needs this fast random access (norms, del docs, terms index) and that's why we put them in RAM. I do agree for all else (the lrge postings), MMap is great. Of course the OS swaps out process RAM anyway, so... it's kinda moot (unless you've fixed your OS to not do this, which I always do!). I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2530) Remove Noggit CharArr from FieldType
Remove Noggit CharArr from FieldType Key: SOLR-2530 URL: https://issues.apache.org/jira/browse/SOLR-2530 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that also spreads into ByteUtils. The uses of this method area all convert to String which makes this extra reference and the dependency unnecessary. I refactored it to simply return string and removed ByteUtils entirely. The only leftover from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will upload a patch in a second -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2530) Remove Noggit CharArr from FieldType
[ https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated SOLR-2530: -- Attachment: SOLR-2530.patch here is a patch Remove Noggit CharArr from FieldType Key: SOLR-2530 URL: https://issues.apache.org/jira/browse/SOLR-2530 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Labels: api-change Fix For: 4.0 Attachments: SOLR-2530.patch FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that also spreads into ByteUtils. The uses of this method area all convert to String which makes this extra reference and the dependency unnecessary. I refactored it to simply return string and removed ByteUtils entirely. The only leftover from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will upload a patch in a second -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Created] (SOLR-2529) DIH update trouble with sql field name pk
Could you identify what you think the problem is? Erick On Thu, May 19, 2011 at 9:45 AM, Thomas Gambier (JIRA) j...@apache.org wrote: DIH update trouble with sql field name pk --- Key: SOLR-2529 URL: https://issues.apache.org/jira/browse/SOLR-2529 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.2 Environment: Debian Lenny, JRE 6 Reporter: Thomas Gambier Priority: Blocker We are unable to use the DIH when database columnName primary key is named pk. The reported solr error is : deltaQuery has no column to resolve to declared primary key pk='pk' We have made some investigations and found that the DIH have a mistake when it's looking for the primary key between row's columns list. private String findMatchingPkColumn(String pk, Map row) { if (row.containsKey(pk)) throw new IllegalArgumentException( String.format(deltaQuery returned a row with null for primary key %s, pk)); String resolvedPk = null; for (String columnName : row.keySet()) { if (columnName.endsWith(. + pk) || pk.endsWith(. + columnName)) { if (resolvedPk != null) throw new IllegalArgumentException( String.format( deltaQuery has more than one column (%s and %s) that might resolve to declared primary key pk='%s', resolvedPk, columnName, pk)); resolvedPk = columnName; } } if (resolvedPk == null) throw new IllegalArgumentException( String.format(deltaQuery has no column to resolve to declared primary key pk='%s', pk)); LOG.info(String.format(Resolving deltaQuery column '%s' to match entity's declared pk '%s', resolvedPk, pk)); return resolvedPk; } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType
[ https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036219#comment-13036219 ] Yonik Seeley commented on SOLR-2530: There are some efficiency losses here: - A reusable CharArr allows one to avoid extra object creation. See TermsComponent which can update a CharArr and then compare it against a pattern w/o having to create a String object. - We should not replace the previous toString with BytesRef.utf8String... it's much slower, esp for small strings like will be common here. So rather than just removing ByteUtils.UTF8toUTF16, how about moving it to BytesRef and use it in BytesRTef.utf8String? Remove Noggit CharArr from FieldType Key: SOLR-2530 URL: https://issues.apache.org/jira/browse/SOLR-2530 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Labels: api-change Fix For: 4.0 Attachments: SOLR-2530.patch FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that also spreads into ByteUtils. The uses of this method area all convert to String which makes this extra reference and the dependency unnecessary. I refactored it to simply return string and removed ByteUtils entirely. The only leftover from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will upload a patch in a second -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1942: -- Attachment: SOLR-1942.patch Updated patch with Simon's previous suggestions. A few more things I saw that I'm not sure I like: * the CodecProvider syntax in the test config is cool, but i'm not sure this should be done in SolrCore? I think if you want to have a CP that loads up codecs by classname like this, it should be done in a CodecProviderFactory (you know parsing arguments however it wants). * I think its confusing how the SchemaCodecProvider answers to codec requests in 3 ways, 1. from the 'delegate' in SolrConfig, 2. from the schema, and 3. from the default codecProvider. I think if you try to use this, its easy to get yourself in a situation where solrconfig conflicts with the schema. I also don't think we need to bother with the 'defaultCP', in other words if you specify a custom codec provider, this is the only one that is used. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType
[ https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036224#comment-13036224 ] Robert Muir commented on SOLR-2530: --- My recommendation: add CharsRef. We already have BytesRef and IntsRef... Remove Noggit CharArr from FieldType Key: SOLR-2530 URL: https://issues.apache.org/jira/browse/SOLR-2530 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Labels: api-change Fix For: 4.0 Attachments: SOLR-2530.patch FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that also spreads into ByteUtils. The uses of this method area all convert to String which makes this extra reference and the dependency unnecessary. I refactored it to simply return string and removed ByteUtils entirely. The only leftover from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will upload a patch in a second -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Michael McCandless-2 wrote: I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike Yes! I don't want to put all my titles into RAM just to sort documents by them when I know Lucene has indexed the titles in sorted order on disk already. Of course the devil is in the details. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961687.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType
[ https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036240#comment-13036240 ] Yonik Seeley commented on SOLR-2530: Minor nit: renaming bigTerm to UnicodeUtil.BIG_UTF8_TERM is a bit misleading since it's not UTF8 at all. Remove Noggit CharArr from FieldType Key: SOLR-2530 URL: https://issues.apache.org/jira/browse/SOLR-2530 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Labels: api-change Fix For: 4.0 Attachments: SOLR-2530.patch FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that also spreads into ByteUtils. The uses of this method area all convert to String which makes this extra reference and the dependency unnecessary. I refactored it to simply return string and removed ByteUtils entirely. The only leftover from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will upload a patch in a second -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036242#comment-13036242 ] Doron Cohen commented on SOLR-2500: --- From Eclipse (XP), passed at 1st attempt, failed at the 2nd! I am not familiar with this part of the code so it would be too much work to track it all the way myself, but I think I can now provide sufficient information for solving it. In Eclipse, after cleaning the project the test passes, and then start failing in all successive runs. So I assume when you run it isolated you also do clean, which covers Eclipse's clean (and more). I tracked the content of the cleaned relevant dir before and after the test - it is (trunk/)bin/solr - there's only one file that differs between the runs - this is bin/solr/shared/solr.xml. Not sure if this is a bug in the test not cleaning after itself or a bug in the code that reads the configuration... I'll attach here the two file so that you can compare them. TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036243#comment-13036243 ] Robert Muir commented on SOLR-2500: --- {quote} In Eclipse, after cleaning the project the test passes, and then start failing in all successive runs. {quote} FYI This is the behavior I've noticed when running the test from Ant also... a 'clean' seems to workaround the issue... TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated SOLR-2500: -- Attachment: solr-after-1st-run.xml solr-clean.xml solr.xml files from trunk/bin/solr/shared: - clean - with which the test passes. - after-1st-run - with which it fails. TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2500: -- Attachment: SOLR-2500.patch I guess the real question is: why doesn't the test work if rewritten like this? Bug in TestHarness? Bug in CoreContainer/properties loading functionality itself? TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Wow, 17 replies to my email overnight! This is clearly an interesting topic to folks. Hi Dawid. Sadly, I won't be at Lucene Revolution next week. That's where all the cool kids will be; I'll be home and be square. I made it to O'Reilly Strata in February (a great conference) and I'll be presenting at Basis's Open Source Search Conference (government customer focused) mid-June. I've used up my conference budget for this fiscal year. Yes, the use-case here is a unique integer reference to a String that can be looked up fairly quickly, whereas the set of all strings are in a compressed data structure that won't change after its built. A bonus benefit would be that this integer is a sortable substitute for the String. Your observation of this integer being a perfect-hash is astute. I wonder if Lucene could store this FST on-disk for the bytes in a segment instead of what it's doing now? Read-time construction would be super-fast, though for multi-segment indexes, I suppose they'd need to be merged. I expect that this use-case would be particularly useful for cases when you know that the set of strings tends to have a great deal of prefixes in common, such as when EdgeNGramming (applications: query-complete, hierarchical faceting, prefix/tree based geospatial indexing). ~ David Smiley Dawid Weiss wrote: Hi David, but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The Yep. Not only, but this is one of the use cases. Will you be at Lucene Revolution next week? I'll be talking about it there. representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? Correct me if my understanding is wrong: you'd like to assign a unique integer to each String and then retrieve it by this integer (something like a Maplt;Integer, Stringgt;)? This would be something called perfect hashing and this can be done on top of an automaton (fairly easily). I assume the data structure is immutable once constructed and does not change too often, right? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. Well, be careful: if you just don't have enough memory to accomodate all the RAM data structures Lucene needs... you're gonna be in trouble with mmap too. True, you won't hit OOMEs anymore, but instead you'll be in a swap fest and your app is nearly unusable. SSDs, while orders of magnitude faster than spinning magnets, are still orders of magnitude slower than RAM. But, yes, they obviously help substantially. It's a one-way door... you'll never go back once you've switched to SSDs. And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides. I wish I could have the opposite of mmap from Java -- the ability to pin the pages that hold important data structures. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036270#comment-13036270 ] Michael McCandless commented on SOLR-2524: -- {quote} bq. Was this a simple TermQuery No a MatchDocAllQuery (:) {quote} Ahh OK then that makes sense -- MatchAllDocsQuery is a might fast query to execute ;) So the work done to cache it is going to be slower. Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Assignee: Michael McCandless Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Need help building JCC on windows
Hi, Baseer. Not sure what's the issue with your build, but here's a bit of bash script which I use to build JCC with mingw on Windows: echo -- jcc -- export PATH=$PATH:${javahome}/jre/bin/client echo PATH is $PATH cd ../pylucene-3.0.*/jcc # note that this patch still works for 3.0.1/3.0.2 patch -p0 ${patchesdir}/jcc-2.9-mingw-PATCH export JCC_ARGSEP=; export JCC_JDK=$WINSTYLEJAVAHOME export JCC_CFLAGS=-fno-strict-aliasing;-Wno-write-strings export JCC_LFLAGS=-L${WINSTYLEJAVAHOME}\\lib;-ljvm export JCC_INCLUDES=${WINSTYLEJAVAHOME}\\include;${WINSTYLEJAVAHOME}\\include\\win32 export JCC_JAVAC=${WINSTYLEJAVAHOME}\\bin\\javac.exe ${python} setup.py build --compiler=mingw32 install --single-version-externally-managed --root /c/ --prefix=${distdir} if [ -f jcc/jcc.lib ]; then cp -p jcc/jcc.lib ${sitepackages}/jcc/jcc.lib fi # for 3.0.2 compiled with MinGW GCC 4.x and --shared, we also need two # GCC libraries if [ -f /mingw/bin/libstdc++-6.dll ]; then install -m 555 /mingw/bin/libstdc++-6.dll ${distdir}/bin/ echo copied libstdc++-6.dll fi if [ -f /mingw/bin/libgcc_s_dw2-1.dll ]; then install -m 555 /mingw/bin/libgcc_s_dw2-1.dll ${distdir}/bin/ echo copied libgcc_s_dw2-1.dll fi The patch that I apply is this: *** setup.py2009-10-28 15:24:16.0 -0700 --- setup.py2010-03-29 22:08:56.0 -0700 *** *** 262,268 elif platform == 'win32': jcclib = 'jcc%s.lib' %(debug and '_d' or '') kwds[extra_link_args] = \ ! lflags + [/IMPLIB:%s %(os.path.join('jcc', jcclib))] package_data.append(jcclib) else: kwds[extra_link_args] = lflags --- 262,268 elif platform == 'win32': jcclib = 'jcc%s.lib' %(debug and '_d' or '') kwds[extra_link_args] = \ ! lflags + [-Wl,--out-implib,%s %(os.path.join('jcc', jcclib))] package_data.append(jcclib) else: kwds[extra_link_args] = lflags It makes sure to build the jcc.lib file so that I can use it in shared mode. Bill
Re: FST and FieldCache?
And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? There's also RAM based SSDs whose performance could be comparable with well, RAM. Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? On Thu, May 19, 2011 at 8:59 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. Well, be careful: if you just don't have enough memory to accomodate all the RAM data structures Lucene needs... you're gonna be in trouble with mmap too. True, you won't hit OOMEs anymore, but instead you'll be in a swap fest and your app is nearly unusable. SSDs, while orders of magnitude faster than spinning magnets, are still orders of magnitude slower than RAM. But, yes, they obviously help substantially. It's a one-way door... you'll never go back once you've switched to SSDs. And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides. I wish I could have the opposite of mmap from Java -- the ability to pin the pages that hold important data structures. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036284#comment-13036284 ] Michael McCandless commented on LUCENE-3108: {quote} bq. How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. {quote} OK that sounds great. {quote} bq. Can SortField somehow detect whether the needed field was stored in FC vs DV This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. {quote} It is tricky... but, eg, when someone does SortField(title, SortField.STRING), which cache (DV or FC) should we populate? {quote} bq. Should we rename oal.index.values.Type - .ValueType? agreed. I also think we should rename Source but I don't have a good name yet. Any idea? {quote} ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear? {quote} bq. Since we dynamically reserve a value to mean unset, does that mean there are some datasets we cannot index? Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? {quote} OK, but I think if we make a straight longs impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean unset? Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3108.patch Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which runs after each test) * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually) * Extended TestSort for DocValues * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) - Source.java / DocValuesEnum.java * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) - SourceCache.java PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome! Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-docvalues-branch - Build # 1145 - Failure
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-docvalues-branch/1145/ No tests ran. Build Log (for compile errors): [...truncated 55 lines...] clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build clean: clean: [echo] Building analyzers-common... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/common [echo] Building analyzers-icu... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/icu [echo] Building analyzers-phonetic... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/phonetic [echo] Building analyzers-smartcn... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/smartcn [echo] Building analyzers-stempel... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/stempel [echo] Building benchmark... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/benchmark/build [echo] Building grouping... clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/grouping/build clean-contrib: clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/analysis-extras/build [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/analysis-extras/lucene-libs clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/clustering/build clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/dataimporthandler/target clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/extraction/build clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/uima/build clean: [delete] Deleting directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/build BUILD SUCCESSFUL Total time: 3 seconds + cd /home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene + JAVA_HOME=/home/hudson/tools/java/latest1.5 /home/hudson/tools/ant/latest1.7/bin/ant compile compile-test build-contrib Buildfile: build.xml jflex-uptodate-check: jflex-notice: javacc-uptodate-check: javacc-notice: init: clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: common.compile-core: [mkdir] Created dir: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build/classes/java [javac] Compiling 536 source files to /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build/classes/java [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/util/Version.java:80: warning: [dep-ann] deprecated name isnt annotated with @Deprecated [javac] public boolean onOrAfter(Version other) { [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java:309: cannot find symbol [javac] symbol : constructor IOException(java.lang.Exception) [javac] location: class java.io.IOException [javac] err = new IOException(ioe); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/queryParser/CharStream.java:34: warning: [dep-ann] deprecated name isnt annotated with @Deprecated [javac] int getColumn(); [javac] ^ [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/queryParser/CharStream.java:41: warning: [dep-ann] deprecated name isnt annotated with @Deprecated [javac] int getLine(); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac]
[jira] [Created] (SOLR-2531) remove some per-term waste in SimpleFacets
remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2531) remove some per-term waste in SimpleFacets
[ https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2531: -- Attachment: SOLR-2531.patch remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? I don't know of a straight up comparison... There's also RAM based SSDs whose performance could be comparable with well, RAM. True, though it's through layers of abstraction designed originally for serving files off of spinning magnets :) Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? True, but really if the app knows it won't need that FC entry for a long time (ie, long enough to make it worth unloading/reloading) then it should really unload it. MMap would still have to write all those pages to disk... DocValues actually makes this a lot cheaper because loading DocValues is much (like ~100 X from Simon's testing) faster than populating FieldCache since FieldCache must do all the uninverting. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036288#comment-13036288 ] Doron Cohen commented on SOLR-2500: --- FWIW, also the first clean run would fail if test's tearDown() is modified like this: {noformat} -persistedFile.delete(); +assertTrue(could not delete +persistedFile, persistedFile.delete()); {noformat} For some reason it fails to remove that file - in both Linux and Windows. TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036290#comment-13036290 ] Yonik Seeley commented on LUCENE-3108: -- bq. ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear? Both ValueSource and DocValues have long been used by function queries. Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3108.patch Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which runs after each test) * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually) * Extended TestSort for DocValues * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) - Source.java / DocValuesEnum.java * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) - SourceCache.java PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome! Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036289#comment-13036289 ] Michael McCandless commented on LUCENE-1877: OK. I would strongly recommend using the lock stress test (LockStressTest/LockVerifyServer) in Lucene to verify whichever locking you're trying is in fact working properly. Use NativeFSLockFactory as default for new API (direct ctors FSDir.open) -- Key: LUCENE-1877 URL: https://issues.apache.org/jira/browse/LUCENE-1877 Project: Lucene - Java Issue Type: Improvement Components: general/javadocs Reporter: Mark Miller Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch A user requested we add a note in IndexWriter alerting the availability of NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm exit). Seems reasonable to me - we want users to be able to easily stumble upon this class. The below code looks like a good spot to add a note - could also improve whats there a bit - opening an IndexWriter does not necessarily create a lock file - that would depend on the LockFactory used. {code} pOpening an codeIndexWriter/code creates a lock file for the directory in use. Trying to open another codeIndexWriter/code on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException} is also thrown if an IndexReader on the same directory is used to delete documents from the index./p{code} Anyone remember why NativeFSLockFactory is not the default over SimpleFSLockFactory? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-1964) Double-check and fix Maven POM dependencies
[ https://issues.apache.org/jira/browse/SOLR-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved SOLR-1964. --- Resolution: Duplicate Fix Version/s: 3.1 3.2 See LUCENE-2657. Double-check and fix Maven POM dependencies --- Key: SOLR-1964 URL: https://issues.apache.org/jira/browse/SOLR-1964 Project: Solr Issue Type: Bug Components: Build Reporter: Erik Hatcher Priority: Minor Fix For: 3.2, 4.0, 3.1 To include the velocity deps in solr-core-pom.xml.template, something like this: dependency groupIdvelocity/groupId artifactIdvelocity/artifactId version1.6.1/version /dependency -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets
[ https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036294#comment-13036294 ] Yonik Seeley commented on SOLR-2531: Yep - looks like dead code. remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets
[ https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036297#comment-13036297 ] Robert Muir commented on SOLR-2531: --- do the tests cover this = minDF case well? If so, I'll commit. remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets
[ https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036299#comment-13036299 ] Yonik Seeley commented on SOLR-2531: Yep - the minDF (to use the filter cache) defaults to 0. remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036300#comment-13036300 ] Doron Cohen commented on SOLR-2500: --- Oops just noticed I was testing all this time TestSolrProperties and not TestSolrCoreProperties, and, because the error message was the same as in the issue description *No such core: core0* I was sure that this is the same test... Now this is confusing... Hmmm.. the original exception reported above is [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) So perhaps I was working on the correct bug after all and just the JIRA issue title is inaccurate? Or I need to call it a day... :) Anyhow, TestSolrProperties consistently behaves as I described here, while TestSolrCoreProperties consistently passes (when ran in standalone mode). TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
[ https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036307#comment-13036307 ] Michael McCandless commented on LUCENE-3123: Does that repro line reproduce the failure for you Doron? It's odd because that test doesn't make that many fields... oh I see it makes a 100 segment index. I'll drop that to 50... The nightly build also hits too-many-open-files every so often, I suspect because our random-per-field-codec is making too many codecs... I wonder if we should throttle it? Ie if it accumulates too many codecs, to start sharing them b/w fields? TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
[ https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036308#comment-13036308 ] Michael McCandless commented on LUCENE-3123: I dropped it from 100 to 50 segs. Can you test if that works in your env Doron? TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2531) remove some per-term waste in SimpleFacets
[ https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-2531. --- Resolution: Fixed Committed revision 1125011. Thanks for reviewing Yonik. remove some per-term waste in SimpleFacets -- Key: SOLR-2531 URL: https://issues.apache.org/jira/browse/SOLR-2531 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-2531.patch While looking at SOLR-2530, Seems like in the 'use filter cache' case of SimpleFacets we: 1. convert the bytes from utf8-utf16 2. create a string from the utf16 3. create a Term object from the string doesn't seem like any of this is necessary, as the Term is unused... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036318#comment-13036318 ] Michael McCandless commented on SOLR-2500: -- For me, it's TestSolrProperties that reliably fails it's it's been run before. Ie, it passes on first run after ant clean but then fails thereafter. TestSolrCoreProperties seems to run fine. (Fedora 13). TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
: I think we should focus on everything that's *infrastructure* in 4.0, so : that we can develop additional features in subsequent 4.x releases. If we : end up releasing 4.0 just to discover many things will need to wait to 5.0, : it'll be a big loss. the catch with that approach (i'm speaking generally here, not with any of these particular lucene examples in mind) is that it's hard to know that the infrastructure really makes sense until you've built a bunch of stuff on it -- i think Josh Bloch has a paper where he says that you shouldn't publish an API abstraction until you've built at least 3 *real* (ie: not just toy or example) implementations of that API. it would be really easy to say the infrastructure for X, Y, and Z is all in 4.0, features that leverage this infra will start coming in 4.1 and then discover on the way to 4.1 that we botched the APIs. what does this mean concretely for the specific big ticket changes that we've got on trunk? ... i dunno, just my word of caution. : we just started the discussion about Lucene 3.2 and releasing more : often. Yet, I think we should also start planning for Lucene 4.0 soon. : We have tons of stuff in trunk that people want to have and we can't : just keep on talking about it - we need to push this out to our users. I agree, but i think the other approach we should take is to be more agressive about reviewing things that would be good candidates for backporting. If we feel like some feature has a well defined API on trunk, and it's got good tests, and people have been using it and filing bugs and helping to make it better then we should consider it a candidate for backporting -- if the merge itself looks like it would be a huge pain in hte ass we don't *have* to backport, but we should at least look. That may not help for any of the big ticket infra changes discussed in this thread (where we know it really needs to wait for a major release) but it would definitely help with the get features out to users faster issue. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036319#comment-13036319 ] Robert Muir commented on SOLR-2500: --- OK, i think you might be right... TestSolrProperties is the one that just failed for me. I'll look into this test now (though I'm still confused about TestSolrCoreProperties but i'll let that be) TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
[ https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036322#comment-13036322 ] Doron Cohen commented on LUCENE-3123: - Yes, thanks, now it passes (trunk) - with this seed as well quite a few times without specifying a seed. I'll now verify on 3x. TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 20:43, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? I don't know of a straight up comparison... I did compare MMapDir vs RAMDir variant a couple of years ago. Searches slowed down a teeny-weeny little bit. GC times went down noticeably. For me it was a big win. Whatever Mike might say, mmap is great for latency-conscious applications : ) If someone tries to create artificial benchmark for byte[] VS ByteBuffer, I'd recommend going through Lucene's abstraction layer. If you simply read/write in a loop, JIT will optimize away boundary checks for byte[] in some cases. This didn't ever happen to *Buffer family for me. There's also RAM based SSDs whose performance could be comparable with well, RAM. True, though it's through layers of abstraction designed originally for serving files off of spinning magnets :) Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? True, but really if the app knows it won't need that FC entry for a long time (ie, long enough to make it worth unloading/reloading) then it should really unload it. MMap would still have to write all those pages to disk... DocValues actually makes this a lot cheaper because loading DocValues is much (like ~100 X from Simon's testing) faster than populating FieldCache since FieldCache must do all the uninverting. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-1143) Return partial results when a connection to a shard is refused
[ https://issues.apache.org/jira/browse/SOLR-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-1143: - Assignee: (was: Grant Ingersoll) Return partial results when a connection to a shard is refused -- Key: SOLR-1143 URL: https://issues.apache.org/jira/browse/SOLR-1143 Project: Solr Issue Type: Improvement Components: search Reporter: Nicolas Dessaigne Fix For: 3.2 Attachments: SOLR-1143-2.patch, SOLR-1143-3.patch, SOLR-1143.patch If any shard is down in a distributed search, a ConnectException it thrown. Here's a little patch that change this behaviour: if we can't connect to a shard (ConnectException), we get partial results from the active shards. As for TimeOut parameter (https://issues.apache.org/jira/browse/SOLR-502), we set the parameter partialResults at true. This patch also adresses a problem expressed in the mailing list about a year ago (http://www.nabble.com/partialResults,-distributed-search---SOLR-502-td19002610.html) We have a use case that needs this behaviour and we would like to know your thougths about such a behaviour? Should it be the default behaviour for distributed search? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files
[ https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036331#comment-13036331 ] Doron Cohen commented on LUCENE-3123: - I fact in 3x this is not reproducible with same seed (expected as Robert once explained) and I was not able to reproduce it with no seed, tried with -Dtest.iter=100 as well (though I am not sure, would a new seed be created in each iteration? Need to verify this...) Anyhow in 3x the test passes also after svn up with this fix. So I think this can be resolved... TestIndexWriter.testBackgroundOptimize fails with too many open files - Key: LUCENE-3123 URL: https://issues.apache.org/jira/browse/LUCENE-3123 Project: Lucene - Java Issue Type: Bug Components: core/index Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 (32-bit)/cpus=1,threads=2 Reporter: Doron Cohen Recreate with this line: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize -Dtests.seed=-3981504507637360146:51354004663342240 Might be related to LUCENE-2873 ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0
[ https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2500: -- Attachment: SOLR-2500.patch The attached patch is a workaround for the issue for now, but we should fix the test to be cleaner as I don't like whats going on here. Whats happening is the test changes its solr.xml configuration file, which is in build/tests/solr/shared/solr.xml. The next time you run the tests, it wont copy over this file because it has a newer time. In my opinion the test should really make its own private home so it won't meddle with other tests or have problems like this (we can fix the test to do this), but this is a simple intermediate fix if you guys don't mind testing it. TestSolrCoreProperties sometimes fails with no such core: core0 - Key: SOLR-2500 URL: https://issues.apache.org/jira/browse/SOLR-2500 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Attachments: SOLR-2500.patch, SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml [junit] Testsuite: org.apache.solr.client.solrj.embedded.TestSolrProperties [junit] Testcase: testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): Caused an ERROR [junit] No such core: core0 [junit] org.apache.solr.common.SolrException: No such core: core0 [junit] at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) [junit] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) [junit] at org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2371) Add a min() function query, upgrade max() function query to take two value sources
[ https://issues.apache.org/jira/browse/SOLR-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-2371. --- Resolution: Fixed Add a min() function query, upgrade max() function query to take two value sources -- Key: SOLR-2371 URL: https://issues.apache.org/jira/browse/SOLR-2371 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Trivial Fix For: 3.2, 4.0 Attachments: SOLR-2371.patch There doesn't appear to be a min() function. Also, max() only allows a value source and a constant b/c it is from before we had more flexible parsing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org