RE: Need help building JCC on windows

2011-05-19 Thread Thomas Koch
Dear Baseer,
I've never tried with Mingw32 but succeeded in a built with the Microsoft
Visual Studio (Express 9.0) on Win7/32bit. Eggs are available for download
at
http://code.google.com/p/pylucene-win32-binary/downloads/list

If you're only looking for JCC - version 2.6 is available for download. If
you need a more recent version I can try to build and upload. 

Of course that doesn't help you with the Mingw issue. Maybe someone else has
more experience with that one...

Regards,
Thomas
--
OrbiTeam Software GmbH  Co. KG
Endenicher Allee 35
53121 Bonn 
Germany
http://www.orbiteam.de



 -Original Message-
 From: Baseer Khan [mailto:bas...@yahoo.com]
 Sent: Thursday, May 19, 2011 6:26 AM
 To: pylucene-dev@lucene.apache.org
 Subject: Need help building JCC on windows
 
 I don't seem to resolve the undefined reference issue while building JCC
on
 windows 7 using Mingw32 compiler.
 
 Any help?
 
 C:\glassfish3\jdk
 running install
 running bdist_egg
 running egg_info
 writing JCC.egg-info\PKG-INFO
 writing top-level names to JCC.egg-info\top_level.txt
 writing dependency_links to JCC.egg-info\dependency_links.txt
 reading manifest template 'MANIFEST.in'
 writing manifest file 'JCC.egg-info\SOURCES.txt'
 installing library code to build\bdist.win32\egg
 running install_lib
 running build_py
 writing C:\Users\baseer\devenv\jcc\jcc\config.py
 copying jcc\config.py - build\lib.win32-2.7\jcc
 copying jcc\cpp.py - build\lib.win32-2.7\jcc
 copying jcc\python.py - build\lib.win32-2.7\jcc
 copying jcc\windows.py -
  build\lib.win32-2.7\jcc
 copying jcc\__init__.py - build\lib.win32-2.7\jcc
 copying jcc\__main__.py - build\lib.win32-2.7\jcc
 copying jcc\sources\functions.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JArray.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\jcc.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JCCEnv.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JObject.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\types.cpp - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\functions.h - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JArray.h - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JCCEnv.h - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\jccfuncs.h - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\JObject.h - build\lib.win32-2.7\jcc\sources
 copying jcc\sources\macros.h -
  build\lib.win32-2.7\jcc\sources
 copying jcc\patches\patch.4195 - build\lib.win32-2.7\jcc\patches
 copying jcc\patches\patch.43.0.6c11 - build\lib.win32-2.7\jcc\patches
 copying jcc\patches\patch.43.0.6c7 - build\lib.win32-2.7\jcc\patches
 copying jcc\jcc.lib - build\lib.win32-2.7\jcc
 copying jcc\classes\org\apache\jcc\PythonVM.class - build\lib.win32-
 2.7\jcc\cla
 sses\org\apache\jcc
 copying jcc\classes\org\apache\jcc\PythonException.class -
build\lib.win32-
 2.7\
 jcc\classes\org\apache\jcc
 running build_ext
 building 'jcc' extension
 C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -D_jcc_lib -DJCC_VER=2.8
-
 IC:\
 glassfish3\jdk/include -IC:\glassfish3\jdk/include/win32 -I_jcc
-Ijcc/sources
 -I
 C:\Python27\include -IC:\Python27\PC -c jcc/sources/jcc.cpp -o
 build\temp.win32-
 2.7\Release\jcc\sources\jcc.o -DPYTHON -D_JNI_IMPLEMENTATION_ -fno-strict-
 aliasi
 ng
  -Wno-write-strings
 C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -D_jcc_lib -DJCC_VER=2.8
-
 IC:\
 glassfish3\jdk/include -IC:\glassfish3\jdk/include/win32 -I_jcc
-Ijcc/sources
 -I
 C:\Python27\include -IC:\Python27\PC -c jcc/sources/JCCEnv.cpp -o
 build\temp.win
 32-2.7\Release\jcc\sources\jccenv.o -DPYTHON -D_JNI_IMPLEMENTATION_ -fno-
 strict-
 aliasing -Wno-write-strings
 writing build\temp.win32-2.7\Release\jcc\sources\jcc.def
 C:\MinGW\bin\g++.exe -mno-cygwin -shared -Wl,--out-implib,build\lib.win32-
 2.7\jc
 c\jcc.lib -s build\temp.win32-2.7\Release\jcc\sources\jcc.o
build\temp.win32-
 2.7
 \Release\jcc\sources\jccenv.o
build\temp.win32-2.7\Release\jcc\sources\jcc.def
 -
 LC:\Python27\libs -LC:\Python27\PCbuild -lpython27 -lmsvcr90 -o
 build\lib.win32-
 2.7\jcc.dll -LC:\glassfish3\jdk/lib -ljvm -Wl,-S
-Wl,--out-implib,jcc\jcc.lib
 Creating library file:
  jcc\jcc.lib
 build\temp.win32-2.7\Release\jcc\sources\jcc.o:jcc.cpp:(.text+0xc30):
 undefined
 reference to `JNI_GetDefaultJavaVMInitArgs@4'
 build\temp.win32-2.7\Release\jcc\sources\jcc.o:jcc.cpp:(.text+0xed0):
 undefined
 reference to `JNI_CreateJavaVM@12'
 collect2: ld returned 1 exit status
 error: command 'g++' failed with exit status 1




Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
Hi David,

 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The

Yep. Not only, but this is one of the use cases. Will you be at Lucene
Revolution next week? I'll be talking about it there.

 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String from
 the arcs to get there?

Correct me if my understanding is wrong: you'd like to assign a unique
integer to each String and then retrieve it by this integer (something
like a
MapInteger, String)? This would be something called perfect hashing
and this can be done on top of an automaton (fairly easily). I assume
the data structure is immutable once constructed and does not change
too often, right?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs

2011-05-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036028#comment-13036028
 ] 

Simon Willnauer commented on LUCENE-2981:
-

bq. +1 to slash and burn.

+1 go for it!

 Review and potentially remove unused/unsupported Contribs
 -

 Key: LUCENE-2981
 URL: https://issues.apache.org/jira/browse/LUCENE-2981
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2981.patch


 Some of our contribs appear to be lacking for development/support or are 
 missing tests.  We should review whether they are even pertinent these days 
 and potentially deprecate and remove them.
 One of the things we did in Mahout when bringing in Colt code was to mark all 
 code that didn't have tests as @deprecated and then we removed the 
 deprecation once tests were added.  Those that didn't get tests added over 
 about a 6 mos. period of time were removed.
 I would suggest taking a hard look at:
 ant
 db
 lucli
 swing
 (spatial should be gutted to some extent and moved to modules)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036038#comment-13036038
 ] 

Simon Willnauer commented on SOLR-1942:
---

some minor comments:

 * s/nothit/nothing in  // make sure we use the default if nothit is configured
 * add javadoc to CodecProvider#hasFieldCodec(String)
 * SchemaCodecProvider should maybe add its name in toString() and not just 
delegate
 * Maybe we should note in the CHANGES.TXT that IndexReaderFactory now has a 
CodecProvider that should be passed to IR#open()

otherwise it looks good though!


 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene/Solr JIRA

2011-05-19 Thread Simon Willnauer
On Wed, May 18, 2011 at 10:53 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : just a few words. I disagree here with you hoss IMO the suggestion to
 : merge JIRA would help to move us closer together and help close the
 : gap between Solr and Lucene. I think we need to start identifying us
 : with what we work on. It feels like we don't do that today and we
 : should work hard to stop that and make hard breaks that might hurt but

 I just don't see how you think that would help anything ... we still need
 to distinguish Jira issues to identify where in the stack they affect.

 If there is a divide among the developers because of the niches where
 they tend to work, will that divide magicly go away because we partition
 all issues using the component feature of instead of by the Jira
 project feature?

 I don't really see how that makes any sense.

 Even if we all thought it did, and even if the cost/effort of
 migrating/converting were totally free, the user bases (who interact with
 the Solr APIs vs directly using the Lucene-Core/Module APIs) are so
 distinct that I genuinely think sticking with distinct Jira Projects
 makes more sense for our users.

 : JIRA. I'd go even further and nuke the name entirely and call
 : everything lucene - I know not many folks like the idea and it might
 : take a while to bake in but I think for us (PMC / Committers) and the

 Everything already is called Lucene ... the Project is Apache Lucene
 the community is Lucene ... the Lucene project currently releases
 several products, and one of them is called Apache Solr ... if you're
 suggestion that we should ultimately elimianate the name Solr then we'd
 still have to decide what we're going going to call that end product, the
 artifact that we ship that provides the abstraction layer that Solr
 currently provides.

 Even if you mean to suggest that we should only have one unified product
 -- one singular release artifact -- that abstraction layer still needs a
 name.  The name we have now is Solr, it has brand awareness and a user
 base who understands what it means to say they are Installing Solr or
 that a new feature is available when Using Solr

 Eliminating that name doesn't seem like it would benefit the user
 community in anyway.

What I was saying / trying to say is that we as the community should
move closer together.
In all our minds, especially in the users mind Solr is a Project and
Lucene is a Project. If we'd start over I would propose something like
Lucene-httpd or something similar. But don't get me wrong I just went
one step further than Shai since I think his idea made sense. I don't
think all that would be a big issue to users - they use the http
interface and they don't give a shit if its called solr or not.

For us I think it makes a big difference. In our minds though. I agree
with you that solr is a product and lucene is the project but we
should enforce this. Like all namespaces say o.a.solr not
o.a.lucene.solr so it implies we are two projects which is not true. I
am not sure how we should proceed here but to change our minds we must
change facts. Just my opinion.

simon



 -Hoss


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
You cannot get a string out of automaton by its ordinal without
storing additional data.
The string is stored there not as a single arc, but as a sequence of
them (basically.. err.. as a string),
so referencing them is basically writing the string asis. Space
savings here come from sharing arcs between strings.

Though, it's possible to do if you associate an additional number with
each node. (I invented some way, shared it with Mike and forgot.. good
grief :/)

Perfect hashing, on the other hand, is like a MapString, Integer
that accepts a predefined set of N strings and returns an int in
0..N-1 interval.
And it can't do the reverse lookup, by design, that's a lossy
compression for all good perfect hashing algos.
So, it's irrelevant here, huh?

On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:
 I've been pondering how to reduce the size of FieldCache entries when there
 are a large number of Strings. I'd like to facet on such a field with Solr
 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String from
 the arcs to get there?

 ~ David Smiley

 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 Though, it's possible to do if you associate an additional number with
 each node. (I invented some way, shared it with Mike and forgot.. good
 grief :/)

It doesn't need to be invented -- it's a known technique. On each arc
you store the number of strings under that arc; while traversing you
accumulate -- this gives you a unique number for each string (perfect
hash) and a way to locate a string given its number.

 And it can't do the reverse lookup, by design, that's a lossy
 compression for all good perfect hashing algos.
 So, it's irrelevant here, huh?

You can do both the way I described above. Jan Daciuk has details on
many more variants of doing that:

Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal
Bottom-up Deterministic Tree Automata, Intelligent Information Systems
XVI, Proceedings of the International IIS'08 Conference held in
Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam
Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.),
Academic Publishing House Exit, Warszawa 2008.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)
Special Character  Hightlighting issues after 3.1.0 update
---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez


I have the same issue describe here : 
http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none

Looks like the highlighting code changed. Using the example doc, with 1.4 I get 
:

http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: êâîôû]}}}

With 3.1, this now looks like :

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em:
#234;#226;#238;#244;#251;]}}}



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

olivier soyez updated LUCENE-3119:
--

Description: 
I have the same issue describe here : 
http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none

Looks like the highlighting code changed. Using the example doc, with 1.4 I get 
:

http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: êâîôû]}}}

With 3.1, this now looks like :

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em:
{noformat}
#234;#226;#238;#244;#251;]}}}
{noformat}



  was:
I have the same issue describe here : 
http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none

Looks like the highlighting code changed. Using the example doc, with 1.4 I get 
:

http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: êâîôû]}}}

With 3.1, this now looks like :

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em:
#234;#226;#238;#244;#251;]}}}




 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em:
 {noformat}
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

olivier soyez updated LUCENE-3119:
--

Priority: Minor  (was: Major)

 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

olivier soyez updated LUCENE-3119:
--

Description: 
I have the same issue describe here : 
http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none

Looks like the highlighting code changed. Using the example doc, with 1.4 I get 
:

http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true

{noformat}
highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: êâîôû]}}}

With 3.1, this now looks like :

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: 
#234;#226;#238;#244;#251;]}}}
{noformat}



  was:
I have the same issue describe here : 
http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none

Looks like the highlighting code changed. Using the example doc, with 1.4 I get 
:

http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em: êâîôû]}}}

With 3.1, this now looks like :

highlighting:{
UTF8TEST:{
features:[eaiou with emcircumflexes/em:
{noformat}
#234;#226;#238;#244;#251;]}}}
{noformat}




 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

2011-05-19 Thread Doron Cohen (JIRA)
span query matches too many docs when two query terms are the same unless 
inOrder=true
--

 Key: LUCENE-3120
 URL: https://issues.apache.org/jira/browse/LUCENE-3120
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0


spinoff of user list discussion - [SpanNearQuery - inOrder 
parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].

With 3 documents:
*  a b x c d
*  a b b d
*  a b x b y d

Here are a few queries (the number in parenthesis indicates expected #hits):


These ones work *as expected*:
* (1)  in-order, slop=0, b, x, b
* (1)  in-order, slop=0, b, b
* (2)  in-order, slop=1, b, b

These ones match *too many* hits:
* (1)  any-order, slop=0, b, x, b
* (1)  any-order, slop=1, b, x, b
* (1)  any-order, slop=2, b, x, b
* (1)  any-order, slop=3, b, x, b

These ones match *too many* hits as well:
* (1)  any-order, slop=0, b, b
* (2)  any-order, slop=1, b, b

Each of the above passes when using a phrase query (applying the slop, no 
in-order indication in phrase query).

This seems related to a known overlapping spans issue - [non-overlapping Span 
queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, 
so we might decide to close this bug after all, but I would like to at least 
have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

2011-05-19 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3120:


Attachment: LUCENE-3120.patch

Attached test case demonstrating the bug.

 span query matches too many docs when two query terms are the same unless 
 inOrder=true
 --

 Key: LUCENE-3120
 URL: https://issues.apache.org/jira/browse/LUCENE-3120
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3120.patch


 spinoff of user list discussion - [SpanNearQuery - inOrder 
 parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
 With 3 documents:
 *  a b x c d
 *  a b b d
 *  a b x b y d
 Here are a few queries (the number in parenthesis indicates expected #hits):
 These ones work *as expected*:
 * (1)  in-order, slop=0, b, x, b
 * (1)  in-order, slop=0, b, b
 * (2)  in-order, slop=1, b, b
 These ones match *too many* hits:
 * (1)  any-order, slop=0, b, x, b
 * (1)  any-order, slop=1, b, x, b
 * (1)  any-order, slop=2, b, x, b
 * (1)  any-order, slop=3, b, x, b
 These ones match *too many* hits as well:
 * (1)  any-order, slop=0, b, b
 * (2)  any-order, slop=1, b, b
 Each of the above passes when using a phrase query (applying the slop, no 
 in-order indication in phrase query).
 This seems related to a known overlapping spans issue - [non-overlapping Span 
 queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, 
 so we might decide to close this bug after all, but I would like to at least 
 have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

2011-05-19 Thread Greg Tarr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036080#comment-13036080
 ] 

Greg Tarr commented on LUCENE-3120:
---

Thanks for raising this.

 span query matches too many docs when two query terms are the same unless 
 inOrder=true
 --

 Key: LUCENE-3120
 URL: https://issues.apache.org/jira/browse/LUCENE-3120
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3120.patch


 spinoff of user list discussion - [SpanNearQuery - inOrder 
 parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
 With 3 documents:
 *  a b x c d
 *  a b b d
 *  a b x b y d
 Here are a few queries (the number in parenthesis indicates expected #hits):
 These ones work *as expected*:
 * (1)  in-order, slop=0, b, x, b
 * (1)  in-order, slop=0, b, b
 * (2)  in-order, slop=1, b, b
 These ones match *too many* hits:
 * (1)  any-order, slop=0, b, x, b
 * (1)  any-order, slop=1, b, x, b
 * (1)  any-order, slop=2, b, x, b
 * (1)  any-order, slop=3, b, x, b
 These ones match *too many* hits as well:
 * (1)  any-order, slop=0, b, b
 * (2)  any-order, slop=1, b, b
 Each of the above passes when using a phrase query (applying the slop, no 
 in-order indication in phrase query).
 This seems related to a known overlapping spans issue - [non-overlapping Span 
 queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, 
 so we might decide to close this bug after all, but I would like to at least 
 have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
We should add this (lookup by value, when value is guaranteed to
monotonically increase as the key increases) to our core FST APIs?
It's generically useful in many places ;)  I'll open an issue.

EG this would also enable an FST terms index that supports
lookup-by-ord, something VariableGapTermsIndex (this is the one that
uses FST for the index) does not support today.

David, one thing to remember is trunk has already seen drastic
reductions on the RAM required to store DocTerms/Index vs 3.x
(something maybe we should backport to 3.x...).  The bytes for the
terms are now stored as shared byte[] blocks, and the ords/offsets are
stored as packed ints, so we no longer have per-String memory
pointer overhead.  I describe the gains here:
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
-- though, those gains include RAM reduction from the terms index as
well.

While FST w/ lookup-by-monotonic-value would work here, I would be
worried about the per hit of that representation vs what
DocTerms/Index offers today... we should test to see.  Of course, for
certain apps that perf hit is justified, so probably we should make
this an option when populating field cache (ie, in-memory storage
option of using an FST vs using packed ints/byte[]).

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 4:43 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:
 Though, it's possible to do if you associate an additional number with
 each node. (I invented some way, shared it with Mike and forgot.. good
 grief :/)

 It doesn't need to be invented -- it's a known technique. On each arc
 you store the number of strings under that arc; while traversing you
 accumulate -- this gives you a unique number for each string (perfect
 hash) and a way to locate a string given its number.

 And it can't do the reverse lookup, by design, that's a lossy
 compression for all good perfect hashing algos.
 So, it's irrelevant here, huh?

 You can do both the way I described above. Jan Daciuk has details on
 many more variants of doing that:

 Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal
 Bottom-up Deterministic Tree Automata, Intelligent Information Systems
 XVI, Proceedings of the International IIS'08 Conference held in
 Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam
 Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.),
 Academic Publishing House Exit, Warszawa 2008.

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3121) FST should offer lookup-by-output API when output strictly increases

2011-05-19 Thread Michael McCandless (JIRA)
FST should offer lookup-by-output API when output strictly increases


 Key: LUCENE-3121
 URL: https://issues.apache.org/jira/browse/LUCENE-3121
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Michael McCandless
 Fix For: 4.0


Spinoff from FST and FieldCache java-dev thread 
http://lucene.markmail.org/thread/swoawlv3fq4dntvl

FST is able to associate arbitrary outputs with the sorted input keys, but in 
the special (and, common) case where the function is strictly monotonic (each 
output only increases vs prior outputs), such as mapping to term ords or 
mapping to file offsets in the terms dict, we should offer a lookup-by-output 
API that efficiently walks the FST and locates input key (exact or floor or 
ceil) matching that output.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 6:16 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:
 We should add this (lookup by value, when value is guaranteed to
 monotonically increase as the key increases) to our core FST APIs?
 It's generically useful in many places ;)  I'll open an issue.

 The data structure itself should sort of build itself if you create
 an FST with increasing integers because the shared suffix should be
 pushed towards the root anyway, so the only thing would be to correct
 values on all outgoing arcs (they need to contain the count of leaves
 on the subtree) but then, this may be tricky if arc values are
 vcoded... I'd have to think how to do this.

I think, if we add ord as an output to the FST, then it builds
everything we need?  Ie no further data structures should be needed?
Maybe I'm confused :)

 While FST w/ lookup-by-monotonic-value would work here, I would be
 worried about the per hit of that representation vs what

 There are actually two things:

 a) performance; you need to descend in the automaton and some
 bookkeeping to maintain the count of nodes; this adds overhead,

 b) size; the procedure for storing/ calculating perfect hashes I
 described requires leaf counts on each arc and these are usually large
 integers. Even vcoded they bloat the resulting data structure.

Maybe we should iterate on the issue to get down to the specifics?  I
had thought there wouldn't be any backtracking, if the FST had stored
the ord as an output...

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036104#comment-13036104
 ] 

Michael McCandless commented on SOLR-2519:
--

+1 to naming these fields text_example_XXX.  That's a great idea Jan.  I'll do 
that in my next patch...

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2526) Grouping on multiple fields

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036105#comment-13036105
 ] 

Michael McCandless commented on SOLR-2526:
--

Martin, that's a great point -- once we've factored out FunctionQuery, it 
should be easy to make an FQ (does one already exist?) that holds an N-tuple of 
other FQ values.

 Grouping on multiple fields
 ---

 Key: SOLR-2526
 URL: https://issues.apache.org/jira/browse/SOLR-2526
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 4.0
Reporter: Arian Karbasi
Priority: Minor

 Grouping on multiple fields and/or ranges should be an option (X,Y) 
 groupings.   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

If you put the ord as an output the common part will be shifted towards the
front of the tree. This will work if you want to look up a given value
assigned to some string, but will not work if you need to look up the string
from its value. The latter case can be solved if you know which branch to
take while descending from root and the shared prefix alone won't give you
this information. At least I don't see how it could.

I am familiar with the basic prefix hashing procedure suggested by Daciuk
(and other authors), but maybe some progress has been made there, I don't
know... the one I know is really conceptually simple -- since each arc
encodes the number of leaves (or input sequences) in the automaton, you know
which path must lead you to your string. For example if you have a node like
this and seek for the 12-th term:

0 -- 10 -- ...
  +- 10 -- ...
  +- 5 -- ..

you look at the first path, it'd give you terms 1..10, then the next one
contains terms 11..20 so you add 10 to an internal counter which is added to
further computations, descend and repeat the procedure until you find a leaf
node.

Dawid


[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036107#comment-13036107
 ] 

Doron Cohen commented on LUCENE-3068:
-

Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html 
(Mike this is a great tool!) I see no particular slowdown at the last runs.

A thought about these benchmarks, it would be helpful if the checked revision 
would be shown - perhaps as part of the hover text when hovering the mouse on a 
graph point...

 The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at 
 same position
 --

 Key: LUCENE-3068
 URL: https://issues.apache.org/jira/browse/LUCENE-3068
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.0.3, 3.1, 4.0
Reporter: Michael McCandless
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, 
 LUCENE-3068.patch


 In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was
 matching docs that it shouldn't; but I think those changes caused it
 to fail to match docs that it should, specifically when the doc itself
 has tokens at the same position.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3122) Cascaded grouping

2011-05-19 Thread Michael McCandless (JIRA)
Cascaded grouping
-

 Key: LUCENE-3122
 URL: https://issues.apache.org/jira/browse/LUCENE-3122
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/grouping
Reporter: Michael McCandless
 Fix For: 3.2, 4.0


Similar to SOLR-2526, in that you are grouping on 2 separate fields, but 
instead of treating those fields as a single grouping by a compound key, this 
change would let you first group on key1 for the primary groups and then 
secondarily on key2 within the primary groups.

Ie, the result you get back would have groups A, B, C (grouped by key1) but 
then the documents within group A would be grouped by key 2.

I think this will be important for apps whose documents are the product of 
denormalizing, ie where the Lucene document is really a sub-document of a 
different identifier field.  Borrowing an example from LUCENE-3097, you have 
doctors but each doctor may have multiple offices (addresses) where they 
practice and so you index doctor X address as your lucene documents.  In this 
case, your identifier field (that which counts for facets, and should be 
grouped for presentation) is doctorid.  When you offer users search over this 
index, you'd likely want to 1) group by distance (ie,  0.1 miles,  0.2 miles, 
etc., as a function query), but 2) also group by doctorid, ie cascaded grouping.

I suspect this would be easier to implement than it sounds: the per-group 
collector used by the 2nd pass grouping collector for key1's grouping just 
needs to be another grouping collector.  Spookily, though, that collection 
would also have to be 2-pass, so it could get tricky since grouping is sort of 
recursing on itself once we have LUCENE-3112, though, that should enable 
efficient single pass grouping by the identifier (doctorid).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-2526) Grouping on multiple fields

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036105#comment-13036105
 ] 

Michael McCandless edited comment on SOLR-2526 at 5/19/11 10:47 AM:


Martijn, that's a great point -- once we've factored out FunctionQuery, it 
should be easy to make an FQ (does one already exist?) that holds an N-tuple of 
other FQ values.

  was (Author: mikemccand):
Martin, that's a great point -- once we've factored out FunctionQuery, it 
should be easy to make an FQ (does one already exist?) that holds an N-tuple of 
other FQ values.
  
 Grouping on multiple fields
 ---

 Key: SOLR-2526
 URL: https://issues.apache.org/jira/browse/SOLR-2526
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 4.0
Reporter: Arian Karbasi
Priority: Minor

 Grouping on multiple fields and/or ranges should be an option (X,Y) 
 groupings.   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036109#comment-13036109
 ] 

Michael McCandless commented on LUCENE-3068:


bq. A thought about these benchmarks, it would be helpful if the checked 
revision would be shown - perhaps as part of the hover text when hovering the 
mouse on a graph point..

Good idea!  I'll try to do this...

Note that if you go back to the root page, and click on a given day, it tells 
you the svn rev and also hg ref (of luceneutil), so that's a [cumbersome] way 
to get the svn rev.

 The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at 
 same position
 --

 Key: LUCENE-3068
 URL: https://issues.apache.org/jira/browse/LUCENE-3068
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.0.3, 3.1, 4.0
Reporter: Michael McCandless
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, 
 LUCENE-3068.patch


 In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was
 matching docs that it shouldn't; but I think those changes caused it
 to fail to match docs that it should, specifically when the doc itself
 has tokens at the same position.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036111#comment-13036111
 ] 

Doron Cohen commented on LUCENE-3068:
-

bq. Note that if you go back to the root page, and click on a given day, it 
tells you the svn rev and also hg ref (of luceneutil)

Great, thanks!

So, this commit to trunk in r1124293 falls between these two:

- Tue 17/05/2011 Lucene/Solr trunk rev 1104671
- Wed 18/05/2011 Lucene/Solr trunk rev 1124524

... No measurable degradation, good!

 The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at 
 same position
 --

 Key: LUCENE-3068
 URL: https://issues.apache.org/jira/browse/LUCENE-3068
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.0.3, 3.1, 4.0
Reporter: Michael McCandless
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, 
 LUCENE-3068.patch


 In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was
 matching docs that it shouldn't; but I think those changes caused it
 to fail to match docs that it should, specifically when the doc itself
 has tokens at the same position.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2011-05-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036124#comment-13036124
 ] 

Jörn Kottmann commented on LUCENE-2899:
---

The first release is now out. I guess you will use maven for dependency 
management, you can find here how to add the released version as a dependency:
http://incubator.apache.org/opennlp/maven-dependency.html

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Priority: Minor

 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #127: POMs out of sync

2011-05-19 Thread Apache Jenkins Server
Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/127/

No tests ran.

Build Log (for compile errors):
[...truncated 15564 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position

2011-05-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036140#comment-13036140
 ] 

Simon Willnauer commented on LUCENE-3068:
-

bq. Looking at 
http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this 
is a great tool!) I see no particular slowdown at the last runs.
I love it! good that all the work on LuceneUtil pays off!

 The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at 
 same position
 --

 Key: LUCENE-3068
 URL: https://issues.apache.org/jira/browse/LUCENE-3068
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.0.3, 3.1, 4.0
Reporter: Michael McCandless
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, 
 LUCENE-3068.patch


 In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was
 matching docs that it shouldn't; but I think those changes caused it
 to fail to match docs that it should, specifically when the doc itself
 has tokens at the same position.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors FSDir.open)

2011-05-19 Thread Greg Tarr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036142#comment-13036142
 ] 

Greg Tarr commented on LUCENE-1877:
---

Yes, we have multiple machines being able to write to the same index on the 
SAN. 

 Use NativeFSLockFactory as default for new API (direct ctors  FSDir.open)
 --

 Key: LUCENE-1877
 URL: https://issues.apache.org/jira/browse/LUCENE-1877
 Project: Lucene - Java
  Issue Type: Improvement
  Components: general/javadocs
Reporter: Mark Miller
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
 LUCENE-1877.patch


 A user requested we add a note in IndexWriter alerting the availability of 
 NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
 exit). Seems reasonable to me - we want users to be able to easily stumble 
 upon this class. The below code looks like a good spot to add a note - could 
 also improve whats there a bit - opening an IndexWriter does not necessarily 
 create a lock file - that would depend on the LockFactory used.
 {code}  pOpening an codeIndexWriter/code creates a lock file for the 
 directory in use. Trying to open
   another codeIndexWriter/code on the same directory will lead to a
   {@link LockObtainFailedException}. The {@link LockObtainFailedException}
   is also thrown if an IndexReader on the same directory is used to delete 
 documents
   from the index./p{code}
 Anyone remember why NativeFSLockFactory is not the default over 
 SimpleFSLockFactory?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

 If you put the ord as an output the common part will be shifted towards the
 front of the tree. This will work if you want to look up a given value
 assigned to some string, but will not work if you need to look up the string
 from its value. The latter case can be solved if you know which branch to
 take while descending from root and the shared prefix alone won't give you
 this information. At least I don't see how it could.

 I am familiar with the basic prefix hashing procedure suggested by Daciuk
 (and other authors), but maybe some progress has been made there, I don't
 know... the one I know is really conceptually simple -- since each arc
 encodes the number of leaves (or input sequences) in the automaton, you know
 which path must lead you to your string. For example if you have a node like
 this and seek for the 12-th term:

 0 -- 10 -- ...
   +- 10 -- ...
   +- 5 -- ..
 you look at the first path, it'd give you terms 1..10, then the next one
 contains terms 11..20 so you add 10 to an internal counter which is added to
 further computations, descend and repeat the procedure until you find a leaf
 node.

 Dawid

There's a possible speedup here. If, instead of storing the count of
all downstream leaves, you store the sum of counts for all previous
siblings, you can do a binary lookup instead of linear scan on each
node.
Taking your example:

0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

That's what I invented, and yes, it was invented by countless people before :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036150#comment-13036150
 ] 

Koji Sekiguchi commented on LUCENE-3119:


I can reproduce this, but it is due to HtmlEncoder in solrconfig.xml (I've 
mentioned it in the mail thread), and not code change.

 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
This (storing sums) is, I think, exactly what the FST stores as
outputs on the arcs.  Ie, it bounds the range of outputs if you were
to recurse on that arc.

So, from any node, we can unambiguously determine which arc to recurse
on, when looking up by value (only if the value is strictly
monotonic).

It should be straightforward to implement, ie should not require any
additional data structure / storage in the FST.  It's a lookup-only
change, I think.

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote:
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

 If you put the ord as an output the common part will be shifted towards the
 front of the tree. This will work if you want to look up a given value
 assigned to some string, but will not work if you need to look up the string
 from its value. The latter case can be solved if you know which branch to
 take while descending from root and the shared prefix alone won't give you
 this information. At least I don't see how it could.

 I am familiar with the basic prefix hashing procedure suggested by Daciuk
 (and other authors), but maybe some progress has been made there, I don't
 know... the one I know is really conceptually simple -- since each arc
 encodes the number of leaves (or input sequences) in the automaton, you know
 which path must lead you to your string. For example if you have a node like
 this and seek for the 12-th term:

 0 -- 10 -- ...
   +- 10 -- ...
   +- 5 -- ..
 you look at the first path, it'd give you terms 1..10, then the next one
 contains terms 11..20 so you add 10 to an internal counter which is added to
 further computations, descend and repeat the procedure until you find a leaf
 node.

 Dawid

 There's a possible speedup here. If, instead of storing the count of
 all downstream leaves, you store the sum of counts for all previous
 siblings, you can do a binary lookup instead of linear scan on each
 node.
 Taking your example:

 0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
 this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

 That's what I invented, and yes, it was invented by countless people before :)

 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 That's what I invented, and yes, it was invented by countless people before
 :)


You know I didn't mean to sound rude, right? I'm really admiring your
ability to come up with these solutions by yourself, I'm merely copying
other folks' ideas.

Anyway, the optimization you're describing is sure possible. Lucene's FST
implementation can actually combine both approaches because always expanding
nodes is inefficient and those already expanded will allow a binary search
(assuming the automaton structure is known to the implementation).

Another refinement of this idea creates a detached table (err.. index :) of
states to start from inside the automaton, so that you don't have to go
through the initial 2-3 states which are more or less always large and even
binary search is costly there.

Dawid


[jira] [Created] (SOLR-2528) remove HtmlEncoder (or set it to default=false) from example solrconfig.xml

2011-05-19 Thread Koji Sekiguchi (JIRA)
remove HtmlEncoder (or set it to default=false) from example solrconfig.xml
---

 Key: SOLR-2528
 URL: https://issues.apache.org/jira/browse/SOLR-2528
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.1.1, 3.2, 4.0


After 3.1 released, highlight snippets that include non ascii characters are 
encoded to character references by HtmlEncoder if it is set in solrconfig.xml. 
Because solr example config has it, not a few users got confused by the output.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2528) remove HtmlEncoder from example solrconfig.xml (or set it to default=false)

2011-05-19 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2528:
-

Summary: remove HtmlEncoder from example solrconfig.xml (or set it to 
default=false)  (was: remove HtmlEncoder (or set it to default=false) from 
example solrconfig.xml)

 remove HtmlEncoder from example solrconfig.xml (or set it to default=false)
 ---

 Key: SOLR-2528
 URL: https://issues.apache.org/jira/browse/SOLR-2528
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.1.1, 3.2, 4.0


 After 3.1 released, highlight snippets that include non ascii characters are 
 encoded to character references by HtmlEncoder if it is set in 
 solrconfig.xml. Because solr example config has it, not a few users got 
 confused by the output.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Doron Cohen (JIRA)
TestIndexWriter.testBackgroundOptimize fails with too many open files
-

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen


Recreate with this line:

ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
-Dtests.seed=-3981504507637360146:51354004663342240

Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2528) remove HtmlEncoder from example solrconfig.xml (or set it to default=false)

2011-05-19 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2528:
-

Attachment: SOLR-2528.patch

 remove HtmlEncoder from example solrconfig.xml (or set it to default=false)
 ---

 Key: SOLR-2528
 URL: https://issues.apache.org/jira/browse/SOLR-2528
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: SOLR-2528.patch


 After 3.1 released, highlight snippets that include non ascii characters are 
 encoded to character references by HtmlEncoder if it is set in 
 solrconfig.xml. Because solr example config has it, not a few users got 
 confused by the output.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036164#comment-13036164
 ] 

Koji Sekiguchi commented on LUCENE-3119:


I opened SOLR-2528.

 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036163#comment-13036163
 ] 

Doron Cohen commented on LUCENE-3123:
-

This is on Ubuntu btw.

Run log:
{noformat}
NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter 
-Dtestmethod=testBackgroundOptimize 
-Dtests.seed=-3981504507637360146:51354004663342240
NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter 
-Dtestmethod=testBackgroundOptimize 
-Dtests.seed=-3981504507637360146:51354004663342240
The following exceptions were thrown by threads:
*** Thread: Lucene Merge Thread #0 ***
org.apache.lucene.index.MergePolicy$MergeException: 
java.io.FileNotFoundException: /tmp/test4907593285402510583tmp/_51_0.sd (Too 
many open files)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:507)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:472)
Caused by: java.io.FileNotFoundException: 
/tmp/test4907593285402510583tmp/_51_0.sd (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.init(SimpleFSDirectory.java:69)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.init(SimpleFSDirectory.java:90)
at 
org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:56)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:337)
at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:402)
at 
org.apache.lucene.index.codecs.mockrandom.MockRandomCodec.fieldsProducer(MockRandomCodec.java:236)
at 
org.apache.lucene.index.PerFieldCodecWrapper$FieldsReader.init(PerFieldCodecWrapper.java:113)
at 
org.apache.lucene.index.PerFieldCodecWrapper.fieldsProducer(PerFieldCodecWrapper.java:210)
at 
org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:131)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:495)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:635)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3260)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2930)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:379)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:447)
NOTE: test params are: codec=RandomCodecProvider: {field=MockRandom}, 
locale=nl_NL, timezone=Turkey
NOTE: all tests run in this JVM:
[TestIndexWriter]
NOTE: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 1.6.0_20 
(32-bit)/cpus=1,threads=2,free=26480072,total=33468416
{noformat}

 TestIndexWriter.testBackgroundOptimize fails with too many open files
 -

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
 1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen

 Recreate with this line:
 ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
 -Dtests.seed=-3981504507637360146:51354004663342240
 Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 That's what I invented, and yes, it was invented by countless people
 before :)
 You know I didn't mean to sound rude, right? I'm really admiring your
 ability to come up with these solutions by yourself, I'm merely copying
 other folks' ideas.
I tried to prevent another reference to mr. Daciuk :)

 Anyway, the optimization you're describing is sure possible. Lucene's FST
 implementation can actually combine both approaches because always expanding
 nodes is inefficient and those already expanded will allow a binary search
 (assuming the automaton structure is known to the implementation).
 Another refinement of this idea creates a detached table (err.. index :) of
 states to start from inside the automaton, so that you don't have to go
 through the initial 2-3 states which are more or less always large and even
 binary search is costly there.
 Dawid

But you have to lookup this err..index somehow. And that's either
binary or hash lookup. Where's the win?


-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Robert Muir
2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


or should we actually try to have different fieldcacheimpls?

I see all these missions to refactor the thing, which always fail.

maybe thats because we have one huge monolithic implementation.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 I tried to prevent another reference to mr. Daciuk :)

Why? He's a very nice guy :)

 But you have to lookup this err..index somehow. And that's either
 binary or hash lookup. Where's the win?

You can do a sparse O(1) index and have a slight gain from these few
large initial states. This only makes sense if you perform tons of
these lookups, really.

Mike's right -- the FST will output a structure that is ready to be
used for by-number retrieval of strings (or anything else) as long as
the numbers are strictly monotonous (and preferably continuous). The
output will be what you're suggesting, Earwin -- accumulated sums.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 maybe thats because we have one huge monolithic implementation

Doesn't the DocValues branch solve this?

Also, instead of trying to implement clever ways of compressing
strings in the field cache, which probably won't bare fruit, I'd
prefer to look at [eventually] MMap'ing (using DV) the field caches to
avoid the loading and heap costs, which are signifcant.  I'm not sure
if we can easily MMap packed ints and the shared byte[], though it
seems fairly doable?

On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote:
 2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


 or should we actually try to have different fieldcacheimpls?

 I see all these missions to refactor the thing, which always fail.

 maybe thats because we have one huge monolithic implementation.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036173#comment-13036173
 ] 

olivier soyez commented on LUCENE-3119:
---

Yes, it is just due to HtmlEncoder in solrconfig.xml. Thank you so much, it's 
working!

 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

olivier soyez resolved LUCENE-3119.
---

Resolution: Not A Problem

 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-3119) Special Character Hightlighting issues after 3.1.0 update

2011-05-19 Thread olivier soyez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

olivier soyez closed LUCENE-3119.
-


 Special Character  Hightlighting issues after 3.1.0 update
 ---

 Key: LUCENE-3119
 URL: https://issues.apache.org/jira/browse/LUCENE-3119
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.1
 Environment: ubuntu 10.10, java version 1.6.0_02
Reporter: olivier soyez
Priority: Minor

 I have the same issue describe here : 
 http://lucene.472066.n3.nabble.com/Special-Character-amp-Hightlighting-issues-after-3-1-0-update-tc2820405.html#none
 Looks like the highlighting code changed. Using the example doc, with 1.4 I 
 get :
 http://localhost:8983/solr/select?q=features:circumflexeshl=truehl.fl=featureswt=jsonindent=true
 {noformat}
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: êâîôû]}}}
 With 3.1, this now looks like :
 highlighting:{
 UTF8TEST:{
 features:[eaiou with emcircumflexes/em: 
 #234;#226;#238;#244;#251;]}}}
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents

2011-05-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3112:
---

Attachment: LUCENE-3112.patch

New patch, I think it's ready to commit but it could use some healthy 
reviewing...

I fang'd up TestNRThreads to add/update doc blocks and verify the docs in each 
block remain adjacent, and also added a couple other test cases to make sure we 
test non-aborting exceptions when adding a doc block.

And I put warning in the jdocs about possible future full re-indexing.

 Add IW.add/updateDocuments to support nested documents
 --

 Key: LUCENE-3112
 URL: https://issues.apache.org/jira/browse/LUCENE-3112
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3112.patch, LUCENE-3112.patch


 I think nested documents (LUCENE-2454) is a very compelling addition
 to Lucene.  It's also a popular (many votes) issue.
 Beyond supporting nested document querying, which is already an
 incredible addition since it preserves the relational model on
 indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
 should also enable speedups in grouping implementation when you group
 by a nested field.
 For the same reason, it can also enable very fast post-group facet
 counting impl (LUCENE-3097) when you what to
 count(distinct(nestedField)), instead of unique documents, as your
 identifier.  I expect many apps that use faceting need this ability
 (to count(distinct(nestedField)) not distinct(docID)).
 To support these use cases, I believe the only core change needed is
 the ability to atomically add or update multiple documents, which you
 cannot do today since in between add/updateDocument calls a flush (eg
 due to commit or getReader()) could occur.
 This new API (addDocuments(IterableDocument), updateDocuments(Term
 delTerm, IterableDocument) would also further guarantee that the
 documents are assigned sequential docIDs in the order the iterator
 provided them, and that the docIDs all reside in one segment.
 Segment merging never splits segments apart, so this invariant would
 hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036177#comment-13036177
 ] 

Robert Muir commented on SOLR-1942:
---

Hi Simon, 

after reviewing the patch I have some concerns about CodecProvider. I think its 
a little bit confusing how the CodecProvider/CoreCodecProvider hierarchy works 
today, and a bit dangerous how we delegate over this class.

For example, if we add a new method to CodecProvider, we need to be sure we add 
the 'delegation' here every time or stuff will start acting strange.

For this reason, I wonder if CodecProvider should be an interface: the simple 
implementation we have in lucene is a hashmap, but Solr uses fieldType lookup. 
This would parallel how SimilarityProvider works.

If we want to do this, I think we should open a separate issue... in fact I'm 
not even sure it should block this issue since in my opinion its a shame you 
cannot manipulate codecs in Solr right now... but I just wanted to bring it up 
here.


 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
This is more about compressing strings in TermsIndex, I think.
And ability to use said TermsIndex directly in some cases that
required FieldCache before. (Maybe FC is still needed, but it can be
degraded to docId-ord map, storing actual strings in TI).
This yields fat space savings when we, eg,  need to both lookup on a
field and build facets out of it.

mmap is cool :)  What I want to see is a FST-based TermsDict that is
simply mmaped into memory, without building intermediate indexes, like
Lucene does now.
And docvalues are orthogonal to that, no?

On Thu, May 19, 2011 at 17:22, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

 On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote:
 2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


 or should we actually try to have different fieldcacheimpls?

 I see all these missions to refactor the thing, which always fail.

 maybe thats because we have one huge monolithic implementation.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036182#comment-13036182
 ] 

Michael McCandless commented on SOLR-1942:
--

I agree the CodecProvider/CoreCodecProvider is a scary potential delegation 
trap... Robert can you open a new issue?  I agree it should not block this one.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 This is more about compressing strings in TermsIndex, I think.

Ah, because they're sorted.  I think if the string lookup cost
degrades then it's not worth it?  That's something that needs to be
tested in the MMap case as well, eg, are ByteBuffers somehow slowing
down everything by a factor of 10%?

On Thu, May 19, 2011 at 6:30 AM, Earwin Burrfoot ear...@gmail.com wrote:
 This is more about compressing strings in TermsIndex, I think.
 And ability to use said TermsIndex directly in some cases that
 required FieldCache before. (Maybe FC is still needed, but it can be
 degraded to docId-ord map, storing actual strings in TI).
 This yields fat space savings when we, eg,  need to both lookup on a
 field and build facets out of it.

 mmap is cool :)  What I want to see is a FST-based TermsDict that is
 simply mmaped into memory, without building intermediate indexes, like
 Lucene does now.
 And docvalues are orthogonal to that, no?


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:

 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

Hopefully DocValues will replace FieldCache over time; maybe some day
we can deprecate  remove FieldCache.

But we still have work to do there, I believe; eg we don't have
comparators for all types (on the docvalues branch) yet.

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

In fact, the packed ints and the byte[] packing of terms data is very
much amenable/necessary for using MMap, far moreso than the separate
objects we had before.

I agree we should make an mmap option, though I would generally
recommend against apps using mmap for these caches.  We load these
caches so that we'll have fast random access to potentially a great
many documents during collection of one query (eg for sorting).  When
you mmap them you let the OS decide when to swap stuff out which mean
you pick up potentially high query latency waiting for these pages to
swap back in.  Various other data structures in Lucene needs this fast
random access (norms, del docs, terms index) and that's why we put
them in RAM.  I do agree for all else (the lrge postings), MMap is
great.

Of course the OS swaps out process RAM anyway, so... it's kinda moot
(unless you've fixed your OS to not do this, which I always do!).

I think a more productive area of exploration (to reduce RAM usage)
would be to make a StringFieldComparator that doesn't need full access
to all terms data, ie, operates per segment yet only does a few ord
lookups when merging the results across segments.  If few is small
enough we can just use us the seek-by-ord from the terms dict to do
them.  This would be a huge RAM reduction because we could then sort
by string fields (eg title field) without needing all term bytes
randomly accessible.

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3124) review CodecProvider/CoreCodecProvider/SchemaCodecProvider hierarchy

2011-05-19 Thread Robert Muir (JIRA)
review CodecProvider/CoreCodecProvider/SchemaCodecProvider hierarchy


 Key: LUCENE-3124
 URL: https://issues.apache.org/jira/browse/LUCENE-3124
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir


As mentioned on SOLR-1942, I think we should revisit the CodecProvider 
hierarchy.

Its a little bit confusing how the class itself isn't really abstract but is 
really an overridable implementation.

One idea would be to make CodecProvider an interface, with Lucene using a 
simple hashmap-backed impl and Solr using the schema-backed impl. This would be 
in line with how SimilarityProvider was done.

It would also be good to review all the methods in CodecProvider and see if we 
can minimize the interface...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036184#comment-13036184
 ] 

Robert Muir commented on SOLR-1942:
---

OK I opened LUCENE-3124 for this

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2529) DIH update trouble with sql field name pk

2011-05-19 Thread Thomas Gambier (JIRA)
DIH update trouble with sql field name pk
---

 Key: SOLR-2529
 URL: https://issues.apache.org/jira/browse/SOLR-2529
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.1, 3.2
 Environment: Debian Lenny, JRE 6
Reporter: Thomas Gambier
Priority: Blocker


We are unable to use the DIH when database columnName primary key is named pk.

The reported solr error is :
deltaQuery has no column to resolve to declared primary key pk='pk'

We have made some investigations and found that the DIH have a mistake when 
it's looking for the primary key between row's columns list.

private String findMatchingPkColumn(String pk, Map row) {
if (row.containsKey(pk))
  throw new IllegalArgumentException(
String.format(deltaQuery returned a row with null for primary key %s, 
pk));
String resolvedPk = null;
for (String columnName : row.keySet()) {
  if (columnName.endsWith(. + pk) || pk.endsWith(. + columnName)) {
if (resolvedPk != null)
  throw new IllegalArgumentException(
String.format(
  deltaQuery has more than one column (%s and %s) that might resolve 
to declared primary key pk='%s',
  resolvedPk, columnName, pk));
  resolvedPk = columnName;
}
  }
  if (resolvedPk == null)
throw new IllegalArgumentException(
  String.format(deltaQuery has no column to resolve to declared primary 
key pk='%s', pk));
  LOG.info(String.format(Resolving deltaQuery column '%s' to match entity's 
declared pk '%s', resolvedPk, pk));
  return resolvedPk;
}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.

2011-05-19 Thread Digy
Not just this version, Lucene.Net 2.9.4 also can read (in theory) the index 
created in 3.0.3. But I haven't tested it myself.
DIGY.

-Original Message-
From: Alexander Bauer [mailto:a...@familie-bauer.info] 
Sent: Thursday, May 19, 2011 8:37 AM
To: lucene-net-...@lucene.apache.org
Subject: Re: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing 
ArrayLists, Hashtables etc. with appropriate Generics.


Can i use this version with an existing index based on lucene.Java 3.0.3 ?

Alex


Am 19.05.2011 00:20, schrieb Digy (JIRA):
  [ 
 https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035795#comment-13035795
  ]

 Digy commented on LUCENENET-412:
 

 Hi All,

 Lucene.Net 2.9.4g is almost ready for testing  feedbacks.

 While injecting generics  making some clean up in code, I tried to be close 
 to lucene 3.0.3 as much as possible.
 Therefore it's position is somewhere between lucene.Java 2.9.4  3.0.3

 DIGY


 PS: For those who might want to try this version:
 It won't probably be a drop-in replacement since there are a few API changes 
 like
 - StopAnalyzer(Liststring  stopWords)
 - Query.ExtractTerms(ICollectionstring)
 - TopDocs.*TotalHits*, TopDocs.*ScoreDocs*
 and some removed methods/classes like
 - Filter.Bits
 - JustCompileSearch
 - Contrib/Similarity.Net




 Replacing ArrayLists, Hashtables etc. with appropriate Generics.
 

  Key: LUCENENET-412
  URL: https://issues.apache.org/jira/browse/LUCENENET-412
  Project: Lucene.Net
   Issue Type: Improvement
 Affects Versions: Lucene.Net 2.9.4
 Reporter: Digy
 Priority: Minor
  Fix For: Lucene.Net 2.9.4

  Attachments: IEquatable for QuerySubclasses.patch, 
 LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix


 This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some 
 performance gains.
 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036189#comment-13036189
 ] 

Simon Willnauer commented on SOLR-1942:
---

bq. OK I opened LUCENE-3124 for this

+1 thanks! good point!

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

Right, however if one is using lets say SSDs, and the query time is
less important, then MMap'ing would be fine.  Also it prevents deadly
OOMs in favor of basic 'slowness' of the query.  If there is no
performance degradation I think MMap'ing is a great option.  A common
use case is an index that's far too large for a given server will
simply not work today, whereas with MMap'ed field caches the query
would complete, just extremely slowly.  If the user wishes to improve
performance it's easy enough to add more hardware.

On Thu, May 19, 2011 at 6:40 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:

 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

 Hopefully DocValues will replace FieldCache over time; maybe some day
 we can deprecate  remove FieldCache.

 But we still have work to do there, I believe; eg we don't have
 comparators for all types (on the docvalues branch) yet.

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

 In fact, the packed ints and the byte[] packing of terms data is very
 much amenable/necessary for using MMap, far moreso than the separate
 objects we had before.

 I agree we should make an mmap option, though I would generally
 recommend against apps using mmap for these caches.  We load these
 caches so that we'll have fast random access to potentially a great
 many documents during collection of one query (eg for sorting).  When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in.  Various other data structures in Lucene needs this fast
 random access (norms, del docs, terms index) and that's why we put
 them in RAM.  I do agree for all else (the lrge postings), MMap is
 great.

 Of course the OS swaps out process RAM anyway, so... it's kinda moot
 (unless you've fixed your OS to not do this, which I always do!).

 I think a more productive area of exploration (to reduce RAM usage)
 would be to make a StringFieldComparator that doesn't need full access
 to all terms data, ie, operates per segment yet only does a few ord
 lookups when merging the results across segments.  If few is small
 enough we can just use us the seek-by-ord from the terms dict to do
 them.  This would be a huge RAM reduction because we could then sort
 by string fields (eg title field) without needing all term bytes
 randomly accessible.

 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2530) Remove Noggit CharArr from FieldType

2011-05-19 Thread Simon Willnauer (JIRA)
Remove Noggit CharArr from FieldType


 Key: SOLR-2530
 URL: https://issues.apache.org/jira/browse/SOLR-2530
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0


FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that 
also spreads into ByteUtils. The uses of this method area all convert to String 
which makes this extra reference and the dependency unnecessary. I refactored 
it to simply return string and removed ByteUtils entirely. The only leftover 
from BytesUtils is a constant, i moved that one to Lucenes UnicodeUtils. I will 
upload a patch in a second

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2530) Remove Noggit CharArr from FieldType

2011-05-19 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated SOLR-2530:
--

Attachment: SOLR-2530.patch

here is a patch

 Remove Noggit CharArr from FieldType
 

 Key: SOLR-2530
 URL: https://issues.apache.org/jira/browse/SOLR-2530
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
  Labels: api-change
 Fix For: 4.0

 Attachments: SOLR-2530.patch


 FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that 
 also spreads into ByteUtils. The uses of this method area all convert to 
 String which makes this extra reference and the dependency unnecessary. I 
 refactored it to simply return string and removed ByteUtils entirely. The 
 only leftover from BytesUtils is a constant, i moved that one to Lucenes 
 UnicodeUtils. I will upload a patch in a second

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Created] (SOLR-2529) DIH update trouble with sql field name pk

2011-05-19 Thread Erick Erickson
Could you identify what you think the problem is?

Erick

On Thu, May 19, 2011 at 9:45 AM, Thomas Gambier (JIRA) j...@apache.org wrote:
 DIH update trouble with sql field name pk
 ---

                 Key: SOLR-2529
                 URL: https://issues.apache.org/jira/browse/SOLR-2529
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
    Affects Versions: 3.1, 3.2
         Environment: Debian Lenny, JRE 6
            Reporter: Thomas Gambier
            Priority: Blocker


 We are unable to use the DIH when database columnName primary key is named 
 pk.

 The reported solr error is :
 deltaQuery has no column to resolve to declared primary key pk='pk'

 We have made some investigations and found that the DIH have a mistake when 
 it's looking for the primary key between row's columns list.

 private String findMatchingPkColumn(String pk, Map row) {
 if (row.containsKey(pk))
  throw new IllegalArgumentException(
    String.format(deltaQuery returned a row with null for primary key %s, 
 pk));
 String resolvedPk = null;
 for (String columnName : row.keySet()) {
  if (columnName.endsWith(. + pk) || pk.endsWith(. + columnName)) {
    if (resolvedPk != null)
      throw new IllegalArgumentException(
        String.format(
          deltaQuery has more than one column (%s and %s) that might resolve 
 to declared primary key pk='%s',
          resolvedPk, columnName, pk));
      resolvedPk = columnName;
    }
  }
  if (resolvedPk == null)
    throw new IllegalArgumentException(
      String.format(deltaQuery has no column to resolve to declared primary 
 key pk='%s', pk));
  LOG.info(String.format(Resolving deltaQuery column '%s' to match entity's 
 declared pk '%s', resolvedPk, pk));
  return resolvedPk;
 }


 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036219#comment-13036219
 ] 

Yonik Seeley commented on SOLR-2530:


There are some efficiency losses here:
- A reusable CharArr allows one to avoid extra object creation.  See 
TermsComponent which can update a CharArr and then compare it against a pattern 
w/o having to create a String object.
- We should not replace the previous toString with BytesRef.utf8String... it's 
much slower, esp for small strings like will be common here.

So rather than just removing ByteUtils.UTF8toUTF16, how about moving it to 
BytesRef and use it in BytesRTef.utf8String?

 Remove Noggit CharArr from FieldType
 

 Key: SOLR-2530
 URL: https://issues.apache.org/jira/browse/SOLR-2530
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
  Labels: api-change
 Fix For: 4.0

 Attachments: SOLR-2530.patch


 FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that 
 also spreads into ByteUtils. The uses of this method area all convert to 
 String which makes this extra reference and the dependency unnecessary. I 
 refactored it to simply return string and removed ByteUtils entirely. The 
 only leftover from BytesUtils is a constant, i moved that one to Lucenes 
 UnicodeUtils. I will upload a patch in a second

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-1942) Ability to select codec per field

2011-05-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1942:
--

Attachment: SOLR-1942.patch

Updated patch with Simon's previous suggestions.

A few more things I saw that I'm not sure I like:
* the CodecProvider syntax in the test config is cool, but i'm not sure this 
should be done in SolrCore? I think if you want to have a CP that loads up 
codecs by classname like this, it should be done in a CodecProviderFactory (you 
know parsing arguments however it wants).
* I think its confusing how the SchemaCodecProvider answers to codec requests 
in 3 ways, 1. from the 'delegate' in SolrConfig, 2. from the schema, and 3. 
from the default codecProvider. I think if you try to use this, its easy to get 
yourself in a situation where solrconfig conflicts with the schema. I also 
don't think we need to bother with the 'defaultCP', in other words if you 
specify a custom codec provider, this is the only one that is used.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036224#comment-13036224
 ] 

Robert Muir commented on SOLR-2530:
---

My recommendation: add CharsRef. We already have BytesRef and IntsRef...

 Remove Noggit CharArr from FieldType
 

 Key: SOLR-2530
 URL: https://issues.apache.org/jira/browse/SOLR-2530
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
  Labels: api-change
 Fix For: 4.0

 Attachments: SOLR-2530.patch


 FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that 
 also spreads into ByteUtils. The uses of this method area all convert to 
 String which makes this extra reference and the dependency unnecessary. I 
 refactored it to simply return string and removed ByteUtils entirely. The 
 only leftover from BytesUtils is a constant, i moved that one to Lucenes 
 UnicodeUtils. I will upload a patch in a second

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread David Smiley (@MITRE.org)

Michael McCandless-2 wrote:
 
 I think a more productive area of exploration (to reduce RAM usage)
 would be to make a StringFieldComparator that doesn't need full access
 to all terms data, ie, operates per segment yet only does a few ord
 lookups when merging the results across segments.  If few is small
 enough we can just use us the seek-by-ord from the terms dict to do
 them.  This would be a huge RAM reduction because we could then sort
 by string fields (eg title field) without needing all term bytes
 randomly accessible.
 
 Mike
 

Yes!  I don't want to put all my titles into RAM just to sort documents by
them when I know Lucene has indexed the titles in sorted order on disk
already.  Of course the devil is in the details.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961687.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2530) Remove Noggit CharArr from FieldType

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036240#comment-13036240
 ] 

Yonik Seeley commented on SOLR-2530:


Minor nit: renaming bigTerm to UnicodeUtil.BIG_UTF8_TERM is a bit misleading 
since it's not UTF8 at all.

 Remove Noggit CharArr from FieldType
 

 Key: SOLR-2530
 URL: https://issues.apache.org/jira/browse/SOLR-2530
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
  Labels: api-change
 Fix For: 4.0

 Attachments: SOLR-2530.patch


 FieldType#indexedToReadable(BytesRef, CharArr) uses a noggit dependency that 
 also spreads into ByteUtils. The uses of this method area all convert to 
 String which makes this extra reference and the dependency unnecessary. I 
 refactored it to simply return string and removed ByteUtils entirely. The 
 only leftover from BytesUtils is a constant, i moved that one to Lucenes 
 UnicodeUtils. I will upload a patch in a second

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036242#comment-13036242
 ] 

Doron Cohen commented on SOLR-2500:
---

From Eclipse (XP), passed at 1st attempt, failed at the 2nd!

I am not familiar with this part of the code so it would be too much work to 
track it all the way myself, but I think I can now provide sufficient 
information for solving it.

In Eclipse, after cleaning the project the test passes, and then start failing 
in all successive runs. 
So I assume when you run it isolated you also do clean, which covers Eclipse's 
clean (and more). 

I tracked the content of the cleaned relevant dir before and after the test - 
it is (trunk/)bin/solr - there's only one file that differs between the runs - 
this is bin/solr/shared/solr.xml.

Not sure if this is a bug in the test not cleaning after itself or a bug in the 
code that reads the configuration...

I'll attach here the two file so that you can compare them.


 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036243#comment-13036243
 ] 

Robert Muir commented on SOLR-2500:
---

{quote}
In Eclipse, after cleaning the project the test passes, and then start failing 
in all successive runs. 
{quote}

FYI This is the behavior I've noticed when running the test from Ant also... a 
'clean' seems to workaround the issue...

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir

 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated SOLR-2500:
--

Attachment: solr-after-1st-run.xml
solr-clean.xml

solr.xml files from trunk/bin/solr/shared:
- clean - with which the test passes.
- after-1st-run - with which it fails.

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-2500:
--

Attachment: SOLR-2500.patch

I guess the real question is: why doesn't the test work if rewritten like this?

Bug in TestHarness?
Bug in CoreContainer/properties loading functionality itself?

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread David Smiley (@MITRE.org)
Wow, 17 replies to my email overnight! This is clearly an interesting topic
to folks.

Hi Dawid.
Sadly, I won't be at Lucene Revolution next week. That's where all the cool
kids will be; I'll be home and be square. I made it to O'Reilly Strata in
February (a great conference) and I'll be presenting at Basis's Open Source
Search Conference (government customer focused) mid-June.  I've used up my
conference budget for this fiscal year.

Yes, the use-case here is a unique integer reference to a String that can be
looked up fairly quickly, whereas the set of all strings are in a compressed
data structure that won't change after its built. A bonus benefit would be
that this integer is a sortable substitute for the String.  Your observation
of this integer being a perfect-hash is astute.

I wonder if Lucene could store this FST on-disk for the bytes in a segment
instead of what it's doing now? Read-time construction would be super-fast,
though for multi-segment indexes, I suppose they'd need to be merged.

I expect that this use-case would be particularly useful for cases when you
know that the set of strings tends to have a great deal of prefixes in
common, such as when EdgeNGramming (applications: query-complete,
hierarchical faceting, prefix/tree based geospatial indexing).

~ David Smiley


Dawid Weiss wrote:
 
 Hi David,
 
 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The
 
 Yep. Not only, but this is one of the use cases. Will you be at Lucene
 Revolution next week? I'll be talking about it there.
 
 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using
 something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String
 from
 the arcs to get there?
 
 Correct me if my understanding is wrong: you'd like to assign a unique
 integer to each String and then retrieve it by this integer (something
 like a
 Maplt;Integer, Stringgt;)? This would be something called perfect
 hashing
 and this can be done on top of an automaton (fairly easily). I assume
 the data structure is immutable once constructed and does not change
 too often, right?
 
 Dawid
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

 Right, however if one is using lets say SSDs, and the query time is
 less important, then MMap'ing would be fine.  Also it prevents deadly
 OOMs in favor of basic 'slowness' of the query.  If there is no
 performance degradation I think MMap'ing is a great option.  A common
 use case is an index that's far too large for a given server will
 simply not work today, whereas with MMap'ed field caches the query
 would complete, just extremely slowly.  If the user wishes to improve
 performance it's easy enough to add more hardware.

Well, be careful: if you just don't have enough memory to accomodate
all the RAM data structures Lucene needs... you're gonna be in trouble
with mmap too.  True, you won't hit OOMEs anymore, but instead you'll
be in a swap fest and your app is nearly unusable.

SSDs, while orders of magnitude faster than spinning magnets, are
still orders of magnitude slower than RAM.

But, yes, they obviously help substantially.  It's a one-way door...
you'll never go back once you've switched to SSDs.

And I do agree there are times when mmap is appropriate, eg if query
latency is unimportant to you, but it's not a panacea and it comes
with serious downsides.

I wish I could have the opposite of mmap from Java -- the ability to
pin the pages that hold important data structures.

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036270#comment-13036270
 ] 

Michael McCandless commented on SOLR-2524:
--

{quote}
bq. Was this a simple TermQuery

No a MatchDocAllQuery (:)
{quote}

Ahh OK then that makes sense -- MatchAllDocsQuery is a might fast query to 
execute ;)  So the work done to cache it is going to be slower.

 Adding grouping to Solr 3x
 --

 Key: SOLR-2524
 URL: https://issues.apache.org/jira/browse/SOLR-2524
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.2
Reporter: Martijn van Groningen
Assignee: Michael McCandless
 Attachments: SOLR-2524.patch


 Grouping was recently added to Lucene 3x. See LUCENE-1421 for more 
 information.
 I think it would be nice if we expose this functionality also to the Solr 
 users that are bound to a 3.x version.
 The grouping feature added to Lucene is currently a subset of the 
 functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping 
 by function / query.
 The work involved getting the grouping contrib to work on Solr 3x is 
 acceptable. I have it more or less running here. It supports the response 
 format and request parameters (expect: group.query and group.func) described 
 in the FieldCollapse page on the Solr wiki.
 I think it would be great if this is included in the Solr 3.2 release. Many 
 people are using grouping as patch now and this would help them a lot. Any 
 thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Need help building JCC on windows

2011-05-19 Thread Bill Janssen
Hi, Baseer.  Not sure what's the issue with your build, but here's a bit
of bash script which I use to build JCC with mingw on Windows:

echo -- jcc --
export PATH=$PATH:${javahome}/jre/bin/client
echo PATH is $PATH
cd ../pylucene-3.0.*/jcc
# note that this patch still works for 3.0.1/3.0.2
patch -p0  ${patchesdir}/jcc-2.9-mingw-PATCH
export JCC_ARGSEP=;
export JCC_JDK=$WINSTYLEJAVAHOME
export JCC_CFLAGS=-fno-strict-aliasing;-Wno-write-strings
export JCC_LFLAGS=-L${WINSTYLEJAVAHOME}\\lib;-ljvm
export 
JCC_INCLUDES=${WINSTYLEJAVAHOME}\\include;${WINSTYLEJAVAHOME}\\include\\win32
export JCC_JAVAC=${WINSTYLEJAVAHOME}\\bin\\javac.exe
${python} setup.py build --compiler=mingw32 install 
--single-version-externally-managed --root /c/ --prefix=${distdir}
if [ -f jcc/jcc.lib ]; then
  cp -p jcc/jcc.lib ${sitepackages}/jcc/jcc.lib
fi
# for 3.0.2 compiled with MinGW GCC 4.x and --shared, we also need two
# GCC libraries
if [ -f /mingw/bin/libstdc++-6.dll ]; then
  install -m 555 /mingw/bin/libstdc++-6.dll ${distdir}/bin/
  echo copied libstdc++-6.dll
fi
if [ -f /mingw/bin/libgcc_s_dw2-1.dll ]; then
  install -m 555 /mingw/bin/libgcc_s_dw2-1.dll ${distdir}/bin/
  echo copied libgcc_s_dw2-1.dll
fi


The patch that I apply is this:

*** setup.py2009-10-28 15:24:16.0 -0700
--- setup.py2010-03-29 22:08:56.0 -0700
***
*** 262,268 
  elif platform == 'win32':
  jcclib = 'jcc%s.lib' %(debug and '_d' or '')
  kwds[extra_link_args] = \
! lflags + [/IMPLIB:%s %(os.path.join('jcc', jcclib))]
  package_data.append(jcclib)
  else:
  kwds[extra_link_args] = lflags
--- 262,268 
  elif platform == 'win32':
  jcclib = 'jcc%s.lib' %(debug and '_d' or '')
  kwds[extra_link_args] = \
! lflags + [-Wl,--out-implib,%s %(os.path.join('jcc', 
jcclib))]
  package_data.append(jcclib)
  else:
  kwds[extra_link_args] = lflags

It makes sure to build the jcc.lib file so that I can use it in shared mode.

Bill


Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

There's also RAM based SSDs whose performance could be comparable with
well, RAM.  Also, with our heap based field caches, the first sorted
search requires that they be loaded into RAM.  Then we don't unload
them until the reader is closed?  With MMap the unloading would happen
automatically?

On Thu, May 19, 2011 at 8:59 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

 Right, however if one is using lets say SSDs, and the query time is
 less important, then MMap'ing would be fine.  Also it prevents deadly
 OOMs in favor of basic 'slowness' of the query.  If there is no
 performance degradation I think MMap'ing is a great option.  A common
 use case is an index that's far too large for a given server will
 simply not work today, whereas with MMap'ed field caches the query
 would complete, just extremely slowly.  If the user wishes to improve
 performance it's easy enough to add more hardware.

 Well, be careful: if you just don't have enough memory to accomodate
 all the RAM data structures Lucene needs... you're gonna be in trouble
 with mmap too.  True, you won't hit OOMEs anymore, but instead you'll
 be in a swap fest and your app is nearly unusable.

 SSDs, while orders of magnitude faster than spinning magnets, are
 still orders of magnitude slower than RAM.

 But, yes, they obviously help substantially.  It's a one-way door...
 you'll never go back once you've switched to SSDs.

 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides.

 I wish I could have the opposite of mmap from Java -- the ability to
 pin the pages that hold important data structures.

 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036284#comment-13036284
 ] 

Michael McCandless commented on LUCENE-3108:


{quote}
bq. How come codecID changed from String to int on the branch?

due to DocValues I need to compare the ID to certain fields to see for
what field I stored and need to open docValues. I always had to parse
the given string which is kind of odd. I think its more natural to
have the same datatype on FieldInfo, SegmentCodecs and eventually in
the Codec#files() method. Making a string out of it is way simpler /
less risky than parsing IMO.
{quote}

OK that sounds great.

{quote}
bq. Can SortField somehow detect whether the needed field was stored in FC vs DV

This is tricky though. You can have a DV field that is indexed too so its hard 
to tell if we can reliably do it. If we can't make it reliable I think we 
should not do it at all.
{quote}

It is tricky... but, eg, when someone does SortField(title,
SortField.STRING), which cache (DV or FC) should we populate?

{quote}
bq. Should we rename oal.index.values.Type - .ValueType?

agreed. I also think we should rename Source but I don't have a good name yet. 
Any idea?
{quote}

ValueSource?  (conflicts w/ FQs though) Though, maybe we can just
refer to it as DocValues.Source, then it's clear?

{quote}
bq. Since we dynamically reserve a value to mean unset, does that mean there 
are some datasets we cannot index?

Again, tricky! The quick answer is yes, but we can't do that anyway since I 
have not normalize the range to be 0 based since PackedInts doesn't allow 
negative values. so the range we can store is (2^63) -1. So essentially with 
the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. 
Currently there is no assert for this which is needed I think but to get around 
this we need to have a different impl I think or do I miss something?
{quote}

OK, but I think if we make a straight longs impl (ie no packed ints at all) 
then we can handle all long values?  But in that case we'd require the app to 
pick a sentinel to mean unset?


 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-docvalues-branch - Build # 1145 - Failure

2011-05-19 Thread Apache Jenkins Server
Build: 
https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-docvalues-branch/1145/

No tests ran.

Build Log (for compile errors):
[...truncated 55 lines...]
clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build

clean:

clean:
 [echo] Building analyzers-common...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/common
 [echo] Building analyzers-icu...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/icu
 [echo] Building analyzers-phonetic...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/phonetic
 [echo] Building analyzers-smartcn...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/smartcn
 [echo] Building analyzers-stempel...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/analysis/build/stempel
 [echo] Building benchmark...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/benchmark/build
 [echo] Building grouping...

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/modules/grouping/build

clean-contrib:

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/analysis-extras/build
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/analysis-extras/lucene-libs

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/clustering/build

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/dataimporthandler/target

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/extraction/build

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/contrib/uima/build

clean:
   [delete] Deleting directory 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/solr/build

BUILD SUCCESSFUL
Total time: 3 seconds
+ cd 
/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene
+ JAVA_HOME=/home/hudson/tools/java/latest1.5 
/home/hudson/tools/ant/latest1.7/bin/ant compile compile-test build-contrib
Buildfile: build.xml

jflex-uptodate-check:

jflex-notice:

javacc-uptodate-check:

javacc-notice:

init:

clover.setup:

clover.info:
 [echo] 
 [echo]   Clover not found. Code coverage reports disabled.
 [echo] 

clover:

common.compile-core:
[mkdir] Created dir: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build/classes/java
[javac] Compiling 536 source files to 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/build/classes/java
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/util/Version.java:80:
 warning: [dep-ann] deprecated name isnt annotated with @Deprecated
[javac]   public boolean onOrAfter(Version other) {
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java:309:
 cannot find symbol
[javac] symbol  : constructor IOException(java.lang.Exception)
[javac] location: class java.io.IOException
[javac] err = new IOException(ioe);
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/queryParser/CharStream.java:34:
 warning: [dep-ann] deprecated name isnt annotated with @Deprecated
[javac]   int getColumn();
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-docvalues-branch/checkout/lucene/src/java/org/apache/lucene/queryParser/CharStream.java:41:
 warning: [dep-ann] deprecated name isnt annotated with @Deprecated
[javac]   int getLine();
[javac]   ^
[javac] Note: Some input files use or override a deprecated API.
[javac] 

[jira] [Created] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Robert Muir (JIRA)
remove some per-term waste in SimpleFacets
--

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch

While looking at SOLR-2530,

Seems like in the 'use filter cache' case of SimpleFacets we:
1. convert the bytes from utf8-utf16
2. create a string from the utf16
3. create a Term object from the string

doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-2531:
--

Attachment: SOLR-2531.patch

 remove some per-term waste in SimpleFacets
 --

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch


 While looking at SOLR-2530,
 Seems like in the 'use filter cache' case of SimpleFacets we:
 1. convert the bytes from utf8-utf16
 2. create a string from the utf16
 3. create a Term object from the string
 doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

 Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

I don't know of a straight up comparison...

 There's also RAM based SSDs whose performance could be comparable with
 well, RAM.

True, though it's through layers of abstraction designed originally
for serving files off of spinning magnets :)

 Also, with our heap based field caches, the first sorted
 search requires that they be loaded into RAM.  Then we don't unload
 them until the reader is closed?  With MMap the unloading would happen
 automatically?

True, but really if the app knows it won't need that FC entry for a
long time (ie, long enough to make it worth unloading/reloading) then
it should really unload it.  MMap would still have to write all those
pages to disk...

DocValues actually makes this a lot cheaper because loading DocValues
is much (like ~100 X from Simon's testing) faster than populating
FieldCache since FieldCache must do all the uninverting.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036288#comment-13036288
 ] 

Doron Cohen commented on SOLR-2500:
---

FWIW, also the first clean run would fail if test's tearDown() is modified like 
this:

{noformat}
-persistedFile.delete();
+assertTrue(could not delete +persistedFile, persistedFile.delete());
{noformat}

For some reason it fails to remove that file - in both Linux and Windows.

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036290#comment-13036290
 ] 

Yonik Seeley commented on LUCENE-3108:
--

bq. ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to 
it as DocValues.Source, then it's clear?

Both ValueSource and DocValues have long been used by function queries.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors FSDir.open)

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036289#comment-13036289
 ] 

Michael McCandless commented on LUCENE-1877:


OK.  I would strongly recommend using the lock stress test 
(LockStressTest/LockVerifyServer) in Lucene to verify whichever locking you're 
trying is in fact working properly.

 Use NativeFSLockFactory as default for new API (direct ctors  FSDir.open)
 --

 Key: LUCENE-1877
 URL: https://issues.apache.org/jira/browse/LUCENE-1877
 Project: Lucene - Java
  Issue Type: Improvement
  Components: general/javadocs
Reporter: Mark Miller
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
 LUCENE-1877.patch


 A user requested we add a note in IndexWriter alerting the availability of 
 NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
 exit). Seems reasonable to me - we want users to be able to easily stumble 
 upon this class. The below code looks like a good spot to add a note - could 
 also improve whats there a bit - opening an IndexWriter does not necessarily 
 create a lock file - that would depend on the LockFactory used.
 {code}  pOpening an codeIndexWriter/code creates a lock file for the 
 directory in use. Trying to open
   another codeIndexWriter/code on the same directory will lead to a
   {@link LockObtainFailedException}. The {@link LockObtainFailedException}
   is also thrown if an IndexReader on the same directory is used to delete 
 documents
   from the index./p{code}
 Anyone remember why NativeFSLockFactory is not the default over 
 SimpleFSLockFactory?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-1964) Double-check and fix Maven POM dependencies

2011-05-19 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe resolved SOLR-1964.
---

   Resolution: Duplicate
Fix Version/s: 3.1
   3.2

See LUCENE-2657.

 Double-check and fix Maven POM dependencies
 ---

 Key: SOLR-1964
 URL: https://issues.apache.org/jira/browse/SOLR-1964
 Project: Solr
  Issue Type: Bug
  Components: Build
Reporter: Erik Hatcher
Priority: Minor
 Fix For: 3.2, 4.0, 3.1


 To include the velocity deps in solr-core-pom.xml.template, something like 
 this:
 dependency
groupIdvelocity/groupId
artifactIdvelocity/artifactId
version1.6.1/version
 /dependency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036294#comment-13036294
 ] 

Yonik Seeley commented on SOLR-2531:


Yep - looks like dead code.

 remove some per-term waste in SimpleFacets
 --

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch


 While looking at SOLR-2530,
 Seems like in the 'use filter cache' case of SimpleFacets we:
 1. convert the bytes from utf8-utf16
 2. create a string from the utf16
 3. create a Term object from the string
 doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036297#comment-13036297
 ] 

Robert Muir commented on SOLR-2531:
---

do the tests cover this = minDF case well? 

If so, I'll commit.

 remove some per-term waste in SimpleFacets
 --

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch


 While looking at SOLR-2530,
 Seems like in the 'use filter cache' case of SimpleFacets we:
 1. convert the bytes from utf8-utf16
 2. create a string from the utf16
 3. create a Term object from the string
 doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036299#comment-13036299
 ] 

Yonik Seeley commented on SOLR-2531:


Yep - the minDF (to use the filter cache) defaults to 0.

 remove some per-term waste in SimpleFacets
 --

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch


 While looking at SOLR-2530,
 Seems like in the 'use filter cache' case of SimpleFacets we:
 1. convert the bytes from utf8-utf16
 2. create a string from the utf16
 3. create a Term object from the string
 doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036300#comment-13036300
 ] 

Doron Cohen commented on SOLR-2500:
---

Oops just noticed I was testing all this time TestSolrProperties and not 
TestSolrCoreProperties, and, because the error message was the same as in the 
issue description *No such core: core0* I was sure that this is the same 
test... Now this is confusing...

Hmmm.. the original exception reported above is 
[junit] at 
org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)

So perhaps I was working on the correct bug after all and just the JIRA issue 
title is inaccurate?
Or I need to call it a day... :)

Anyhow, TestSolrProperties consistently behaves as I described here, while 
TestSolrCoreProperties consistently passes (when ran in standalone mode).

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036307#comment-13036307
 ] 

Michael McCandless commented on LUCENE-3123:


Does that repro line reproduce the failure for you Doron?  It's odd because 
that test doesn't make that many fields... oh I see it makes a 100 segment 
index. I'll drop that to 50...

The nightly build also hits too-many-open-files every so often, I suspect 
because our random-per-field-codec is making too many codecs... I wonder if we 
should throttle it?  Ie if it accumulates too many codecs, to start sharing 
them b/w fields?

 TestIndexWriter.testBackgroundOptimize fails with too many open files
 -

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
 1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen

 Recreate with this line:
 ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
 -Dtests.seed=-3981504507637360146:51354004663342240
 Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036308#comment-13036308
 ] 

Michael McCandless commented on LUCENE-3123:


I dropped it from 100 to 50 segs.  Can you test if that works in your env Doron?

 TestIndexWriter.testBackgroundOptimize fails with too many open files
 -

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
 1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen

 Recreate with this line:
 ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
 -Dtests.seed=-3981504507637360146:51354004663342240
 Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2531) remove some per-term waste in SimpleFacets

2011-05-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-2531.
---

Resolution: Fixed

Committed revision 1125011.

Thanks for reviewing Yonik.

 remove some per-term waste in SimpleFacets
 --

 Key: SOLR-2531
 URL: https://issues.apache.org/jira/browse/SOLR-2531
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-2531.patch


 While looking at SOLR-2530,
 Seems like in the 'use filter cache' case of SimpleFacets we:
 1. convert the bytes from utf8-utf16
 2. create a string from the utf16
 3. create a Term object from the string
 doesn't seem like any of this is necessary, as the Term is unused...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036318#comment-13036318
 ] 

Michael McCandless commented on SOLR-2500:
--

For me, it's TestSolrProperties that reliably fails it's it's been run before.  
Ie, it passes on first run after ant clean but then fails thereafter.

TestSolrCoreProperties seems to run fine.

(Fedora 13).

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-19 Thread Chris Hostetter

: I think we should focus on everything that's *infrastructure* in 4.0, so
: that we can develop additional features in subsequent 4.x releases. If we
: end up releasing 4.0 just to discover many things will need to wait to 5.0,
: it'll be a big loss.

the catch with that approach (i'm speaking generally here, not with any of 
these particular lucene examples in mind) is that it's hard to know that 
the infrastructure really makes sense until you've built a bunch of stuff 
on it -- i think Josh Bloch has a paper where he says that you shouldn't 
publish an API abstraction until you've built at least 3 *real* 
(ie: not just toy or example) implementations of that API.

it would be really easy to say the infrastructure for X, Y, and Z is all 
in 4.0, features that leverage this infra will start coming in 4.1 and 
then discover on the way to 4.1 that we botched the APIs.

what does this mean concretely for the specific big ticket changes that 
we've got on trunk? ... i dunno, just my word of caution.

:  we just started the discussion about Lucene 3.2 and releasing more
:  often. Yet, I think we should also start planning for Lucene 4.0 soon.
:  We have tons of stuff in trunk that people want to have and we can't
:  just keep on talking about it - we need to push this out to our users.

I agree, but i think the other approach we should take is to be more 
agressive about reviewing things that would be good candidates for 
backporting.

If we feel like some feature has a well defined API on trunk, and it's got 
good tests, and people have been using it and filing bugs and helping to 
make it better then we should consider it a candidate for backporting -- 
if the merge itself looks like it would be a huge pain in hte ass we don't 
*have* to backport, but we should at least look.

That may not help for any of the big ticket infra changes discussed in 
this thread (where we know it really needs to wait for a major release)
but it would definitely help with the get features out to users faster 
issue.



-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036319#comment-13036319
 ] 

Robert Muir commented on SOLR-2500:
---

OK, i think you might be right... TestSolrProperties is the one that just 
failed for me.

I'll look into this test now (though I'm still confused about 
TestSolrCoreProperties but i'll let that be)

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036322#comment-13036322
 ] 

Doron Cohen commented on LUCENE-3123:
-

Yes, thanks, now it passes (trunk) - with this seed as well quite a few times 
without specifying a seed. 
I'll now verify on 3x.

 TestIndexWriter.testBackgroundOptimize fails with too many open files
 -

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
 1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen

 Recreate with this line:
 ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
 -Dtests.seed=-3981504507637360146:51354004663342240
 Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 20:43, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

 Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

 I don't know of a straight up comparison...
I did compare MMapDir vs RAMDir variant a couple of years ago.
Searches slowed down a teeny-weeny little bit. GC times went down
noticeably. For me it was a big win.

Whatever Mike might say, mmap is great for latency-conscious applications : )

If someone tries to create artificial benchmark for byte[] VS
ByteBuffer, I'd recommend going through Lucene's abstraction layer.
If you simply read/write in a loop, JIT will optimize away boundary
checks for byte[] in some cases. This didn't ever happen to *Buffer
family for me.

 There's also RAM based SSDs whose performance could be comparable with
 well, RAM.

 True, though it's through layers of abstraction designed originally
 for serving files off of spinning magnets :)

 Also, with our heap based field caches, the first sorted
 search requires that they be loaded into RAM.  Then we don't unload
 them until the reader is closed?  With MMap the unloading would happen
 automatically?

 True, but really if the app knows it won't need that FC entry for a
 long time (ie, long enough to make it worth unloading/reloading) then
 it should really unload it.  MMap would still have to write all those
 pages to disk...

 DocValues actually makes this a lot cheaper because loading DocValues
 is much (like ~100 X from Simon's testing) faster than populating
 FieldCache since FieldCache must do all the uninverting.

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-1143) Return partial results when a connection to a shard is refused

2011-05-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-1143:
-

Assignee: (was: Grant Ingersoll)

 Return partial results when a connection to a shard is refused
 --

 Key: SOLR-1143
 URL: https://issues.apache.org/jira/browse/SOLR-1143
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Nicolas Dessaigne
 Fix For: 3.2

 Attachments: SOLR-1143-2.patch, SOLR-1143-3.patch, SOLR-1143.patch


 If any shard is down in a distributed search, a ConnectException it thrown.
 Here's a little patch that change this behaviour: if we can't connect to a 
 shard (ConnectException), we get partial results from the active shards. As 
 for TimeOut parameter (https://issues.apache.org/jira/browse/SOLR-502), we 
 set the parameter partialResults at true.
 This patch also adresses a problem expressed in the mailing list about a year 
 ago 
 (http://www.nabble.com/partialResults,-distributed-search---SOLR-502-td19002610.html)
 We have a use case that needs this behaviour and we would like to know your 
 thougths about such a behaviour? Should it be the default behaviour for 
 distributed search?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3123) TestIndexWriter.testBackgroundOptimize fails with too many open files

2011-05-19 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036331#comment-13036331
 ] 

Doron Cohen commented on LUCENE-3123:
-

I fact in 3x this is not reproducible with same seed (expected as Robert once 
explained) and I was not able to reproduce it with no seed, tried with 
-Dtest.iter=100 as well (though I am not sure, would a new seed be created in 
each iteration? Need to verify this...)
Anyhow in 3x the test passes also after svn up with this fix.
So I think this can be resolved...

 TestIndexWriter.testBackgroundOptimize fails with too many open files
 -

 Key: LUCENE-3123
 URL: https://issues.apache.org/jira/browse/LUCENE-3123
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
 Environment: Linux 2.6.32-31-generic i386/Sun Microsystems Inc. 
 1.6.0_20 (32-bit)/cpus=1,threads=2
Reporter: Doron Cohen

 Recreate with this line:
 ant test -Dtestcase=TestIndexWriter -Dtestmethod=testBackgroundOptimize 
 -Dtests.seed=-3981504507637360146:51354004663342240
 Might be related to LUCENE-2873 ?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2500) TestSolrCoreProperties sometimes fails with no such core: core0

2011-05-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-2500:
--

Attachment: SOLR-2500.patch

The attached patch is a workaround for the issue for now, but we should fix the 
test to be cleaner as I don't like whats going on here.

Whats happening is the test changes its solr.xml configuration file, which is 
in build/tests/solr/shared/solr.xml. The next time you run the tests, it wont 
copy over this file because it has a newer time.

In my opinion the test should really make its own private home so it won't 
meddle with other tests or have problems like this (we can fix the test to do 
this), but this is a simple intermediate fix if you guys don't mind testing it.

 TestSolrCoreProperties sometimes fails with no such core: core0
 -

 Key: SOLR-2500
 URL: https://issues.apache.org/jira/browse/SOLR-2500
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: SOLR-2500.patch, SOLR-2500.patch, 
 solr-after-1st-run.xml, solr-clean.xml


 [junit] Testsuite: 
 org.apache.solr.client.solrj.embedded.TestSolrProperties
 [junit] Testcase: 
 testProperties(org.apache.solr.client.solrj.embedded.TestSolrProperties): 
 Caused an ERROR
 [junit] No such core: core0
 [junit] org.apache.solr.common.SolrException: No such core: core0
 [junit] at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118)
 [junit] at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 [junit] at 
 org.apache.solr.client.solrj.embedded.TestSolrProperties.testProperties(TestSolrProperties.java:128)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1260)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1189)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2371) Add a min() function query, upgrade max() function query to take two value sources

2011-05-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved SOLR-2371.
---

Resolution: Fixed

 Add a min() function query, upgrade max() function query to take two value 
 sources
 --

 Key: SOLR-2371
 URL: https://issues.apache.org/jira/browse/SOLR-2371
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Priority: Trivial
 Fix For: 3.2, 4.0

 Attachments: SOLR-2371.patch


 There doesn't appear to be a min() function.  Also, max() only allows a value 
 source and a constant b/c it is from before we had more flexible parsing.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >