[jira] [Created] (SOLR-11306) Solr example schemas inaccurate comments on docValues and StrField

2017-08-31 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-11306:
--

 Summary: Solr example schemas inaccurate comments on  docValues 
and StrField
 Key: SOLR-11306
 URL: https://issues.apache.org/jira/browse/SOLR-11306
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: examples
Affects Versions: 6.6, 7.0
Reporter: Tom Burton-West
Priority: Minor


Several of the example managed-schema files have an outdated comment about 
docValues and StrField.  In Solr 6.6.0 these are under solr-6.6.0/solr/server 
and the lines where the comment starts for each file are:
solr/configsets/basic_configs/conf/managed-schema:216:   
solr/configsets/data_driven_schema_configs/conf/managed-schema:221:
solr/configsets/sample_techproducts_configs/conf/managed-schema:317

In the case of 
Solr-6.6.0/server/solr/configsets/basic_configs/conf/managed-schema, shortly 
after the comment  are some lines which seem to directly contradict the comment:

216  

On line 221 a StrField is declared with docValues that is multiValued:
221  

Also note that the comments above say that the field must either be required or 
have a default value, but line 221 appears to satisfy neither condition.

The JavaDocs indicate that StrField can be multi-valued 
https://lucene.apache.org/core/6_6_0//core/org/apache/lucene/index/DocValuesType.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Error in Solr 6.6 Example schemas re: docValues for StrField type must be single-valued?

2017-08-30 Thread Tom Burton-West
Hello,

There appears to be an error in the comments in the Solr 6.6 example
schemas.  No responses except for one offline to this issue in the Solr
users list so I am posting to the dev list with the hope to fix the
comments if they are indeed in error. See below. Should I open a JIRA issue?

The comments in the example schema's for Solr 6.6, state that the StrField
type must be single-valued to support doc values

For example Solr-6.6.0/server/solr/configsets/basic_configs/conf/
managed-schema:

216  

However, on line 221 a StrField is declared with docValues that is
multiValued:
221  

Also note that the comments above say that the field must either be
required or have a default value, but line 221 appears to satisfy neither
condition.

The JavaDocs indicate that StrField can be multi-valued https://lucene.
apache.org/core/6_6_0//core/org/apache/lucene/index/DocValuesType.html

Is the comment in the example schema file  completely wrong, or is there
some issue with using a docValues with a multivalued StrField?

Tom Burton-West

https://www.hathitrust.org/blogslarge-scale-search


[jira] [Commented] (SOLR-8841) edismax: minimum match and compound words

2016-03-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194276#comment-15194276
 ] 

Tom Burton-West commented on SOLR-8841:
---

This looks very similar to the bug that was fixed in Solr 4.1 for  SOLR-3589
https://issues.apache.org/jira/browse/SOLR-3589
I wonder if the fix somehow got lost in the move to Solr 5.5?  
Does the test labeled "SOLR-3589: Edismax parser does not honor mm parameter if 
analyzer splits a token"  in 
https://github.com/apache/lucene-solr/blob/branch_5_5/solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java
 run ok?

Tom




> edismax: minimum match and compound words
> -
>
> Key: SOLR-8841
> URL: https://issues.apache.org/jira/browse/SOLR-8841
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 5.5, trunk
> Environment: all
>Reporter: Christian Winkler
>
> Hi,
> when searching for a single word which is split by a compound word splitter 
> (very common in German), minimum match is not handled correctly. It is always 
> set to 1 (only a single search term), but as the word is split into several 
> single parts, one matching part is enough
> This also happens if mm is set to 100%.
> Probably mm should be set after the split has been performed. Similar 
> problems might arise with synonymization at search time.
> Regards
> Christian 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6828) Speed up requests for many rows

2015-10-07 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947487#comment-14947487
 ] 

Tom Burton-West commented on LUCENE-6828:
-

Thanks Erick,

I plan to add a docValues id field the next time we re-index all 14 million 
volumes.  After we do our next re-index, I'll give it a try, but I'll have to 
write some code to get the counts from all the shards.  I'll also look at the 
5.x streaming stuff.

Toke, sorry if this is off-topic:)

Tom

> Speed up requests for many rows
> ---
>
> Key: LUCENE-6828
> URL: https://issues.apache.org/jira/browse/LUCENE-6828
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 4.10.4, 5.3
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: memory, performance
>
> Standard relevance ranked searches for top-X results uses the HitQueue class 
> to keep track of the highest scoring documents. The HitQueue is a binary heap 
> of ScoreDocs and is pre-filled with sentinel objects upon creation.
> Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 
> bytes/element and memory access is scattered due to the binary heap algorithm 
> and the use of Objects. To make matters worse, the use of sentinel objects 
> means that even if only a tiny number of documents matches, the full amount 
> of Objects is still allocated.
> As long as the HitQueue is small (< 1000), it performs very well. If top-1M 
> results are requested, it performs poorly and leaves 1M ScoreDocs to be 
> garbage collected.
> An alternative is to replace the ScoreDocs with a single array of packed 
> longs, each long holding the score and the document ID. This strategy 
> requires only 8 bytes/element and is a lot lighter on the GC.
> Some preliminary tests has been done and published at 
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> These indicate that a long[]-backed implementation is at least 3x faster than 
> vanilla HitDocs for top-1M requests.
> For smaller requests, such as top-10, the packed version also seems 
> competitive, when the amount of matched documents exceeds 1M. This needs to 
> be investigated further.
> Going forward with this idea requires some refactoring as Lucene is currently 
> hardwired to the abstract PriorityQueue. Before attempting this, it seems 
> prudent to discuss whether speeding up large top-X requests has any value? 
> Paging seems an obvious contender for requesting large result sets, but I 
> guess the two could work in tandem, opening up for efficient large pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6828) Speed up requests for many rows

2015-10-07 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947328#comment-14947328
 ] 

Tom Burton-West commented on LUCENE-6828:
-

We have a use case where some our users want set-based results. They don't care 
about relevance ranking or sorting, they just want a list of all unique ids 
(external, not Lucene ids) that meet some search criteria. Sometimes these sets 
are in the millions.  We distribute our index over many shards, so an efficient 
method of grabbing all the result ids for large result sets would be extremely 
useful.

Tom Burton-West
https://www.hathitrust.org/blogs/large-scale-search

> Speed up requests for many rows
> ---
>
> Key: LUCENE-6828
> URL: https://issues.apache.org/jira/browse/LUCENE-6828
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 4.10.4, 5.3
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: memory, performance
>
> Standard relevance ranked searches for top-X results uses the HitQueue class 
> to keep track of the highest scoring documents. The HitQueue is a binary heap 
> of ScoreDocs and is pre-filled with sentinel objects upon creation.
> Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 
> bytes/element and memory access is scattered due to the binary heap algorithm 
> and the use of Objects. To make matters worse, the use of sentinel objects 
> means that even if only a tiny number of documents matches, the full amount 
> of Objects is still allocated.
> As long as the HitQueue is small (< 1000), it performs very well. If top-1M 
> results are requested, it performs poorly and leaves 1M ScoreDocs to be 
> garbage collected.
> An alternative is to replace the ScoreDocs with a single array of packed 
> longs, each long holding the score and the document ID. This strategy 
> requires only 8 bytes/element and is a lot lighter on the GC.
> Some preliminary tests has been done and published at 
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> These indicate that a long[]-backed implementation is at least 3x faster than 
> vanilla HitDocs for top-1M requests.
> For smaller requests, such as top-10, the packed version also seems 
> competitive, when the amount of matched documents exceeds 1M. This needs to 
> be investigated further.
> Going forward with this idea requires some refactoring as Lucene is currently 
> hardwired to the abstract PriorityQueue. Before attempting this, it seems 
> prudent to discuss whether speeding up large top-X requests has any value? 
> Paging seems an obvious contender for requesting large result sets, but I 
> guess the two could work in tandem, opening up for efficient large pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Where Search Meets Machine Learning

2015-05-04 Thread Tom Burton-West
Hi Doug and Joaquin,

This is a really interesting discussion.  Joaquin, I'm looking forward to
taking your code for a test drive.  Thank you for making it publicly
available.

Doug,  I'm interested in your pyramid observation.  I work with academic
search which has some of the problems unique queries/information needs and
of data sparsity you mention in your blog post.

This article makes a similar argument that massive amounts of user data are
so important for modern search engines that it is essentially a barrier to
entry for new web search engines.
Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
http://www.springerlink.com/index/58255K40151U036N.pdf

 Tom


> I noticed that information retrieval problems fall into a sort-of layered
> pyramid. At the topmopst point is someone like Google where the sheer
> amount of high quality user behavior data that search truly is a machine
> learning problem, much as you propose. As you move down the pyramid the
> quality of user data diminishes.
>
> Eventually you get to a very thick layer of middle-class search
> applications that value relevance, but have very modest amounts or no user
> data. For most of them, even if they tracked their searches over a year,
> they *might* get good data over their top 50 searches. (I know cause they
> send me the spreadsheet and say fix it!). The best they can use analytics
> data is after-action troubleshooting. Actual user emails complaining about
> the search can be more useful than behavior data!
>
>
>


Re: Solr and non-default minBlockSize/maxBlockSize for PostingsFormat

2015-03-18 Thread Tom Burton-West
Sorry, I know Solr 10 won't be  released for quite some time, since 5 is
the current release...  I meant  Solr 4.10.2

On Wed, Mar 18, 2015 at 4:11 PM, Tom Burton-West  wrote:

> Hello,
>
> Using Solr 10.10.2 I created a wrapper class plugin that instantiates the
> Lucene41PostingsFormat with non-default parameters for the minBlockSize and
> maxBlockSize.   I have created a read-only index. (i.e. there will never be
> any updates to this index.)
>
> I have two questions.
>
> 1)I need to be able to give a copy of the index to several people and it
> would be nice if they didn't have to have a copy of my plugin to read it:
>
> Does the index have metadata in it so that a new Solr instance would not
> need the plugin to read the index properly.  i.e. can the block sizes be
> detected and the index read correctly?
>
>
> 2) How can I tell that the plugin is working (other than lack of OOM
> errors)
> Is there anything in the indexwriter (infostream) log or anywhere else
> where I can confirm that the index that has been written is actually using
> the block sizes given?
> Alternatively is there code in a test case I might use to get the block
> sizes for a segment?
>
> Tom
>
> See this thread for background:
> http://lucene.472066.n3.nabble.com/How-to-configure-Solr-PostingsFormat-block-size-tt4179029.html
>
>
>
>
>


Solr and non-default minBlockSize/maxBlockSize for PostingsFormat

2015-03-18 Thread Tom Burton-West
Hello,

Using Solr 10.10.2 I created a wrapper class plugin that instantiates the
Lucene41PostingsFormat with non-default parameters for the minBlockSize and
maxBlockSize.   I have created a read-only index. (i.e. there will never be
any updates to this index.)

I have two questions.

1)I need to be able to give a copy of the index to several people and it
would be nice if they didn't have to have a copy of my plugin to read it:

Does the index have metadata in it so that a new Solr instance would not
need the plugin to read the index properly.  i.e. can the block sizes be
detected and the index read correctly?


2) How can I tell that the plugin is working (other than lack of OOM errors)
Is there anything in the indexwriter (infostream) log or anywhere else
where I can confirm that the index that has been written is actually using
the block sizes given?
Alternatively is there code in a test case I might use to get the block
sizes for a segment?

Tom

See this thread for background:
http://lucene.472066.n3.nabble.com/How-to-configure-Solr-PostingsFormat-block-size-tt4179029.html


Re: Custom PostingsFormat SPILoader issues

2015-03-13 Thread Tom Burton-West
Hi Hoss,

Thanks for the detailed explanation. This all makes sense now including the
specific error message and my multiple errors .

I put the correct org.apache.lucene.codecs.PostingsFormat in the jar,
 indexed and searched some documents and everything is working fine.

I'll push the code/configuration to our test machines, index a few hundred
GB of book OCR,  and see if this will now enable us to run our normal
indexing and searching with significantly less memory.

Tom

Tom

On Fri, Mar 13, 2015 at 12:26 PM, Chris Hostetter 
wrote:

>
> : I don't really understand SPI and class loaders, but you are right this
> : class is a subclass of PostingsFormat not Codecs.   So is there an issue
> : with the whole idea, or is there just some subtlety of class loading and
> : the SPILoader I'm not understanding?
>
> SPI is just a mechanism Java gives you to dynamicly load Class
> implementations of an Abstraction using a symbolic name -- it's a
> factory pattern, where the Factory (provided by the JVM) doesn't have
> to know all of the implementations (which can be written by users and
> made available on the classpath in arbitrary Jar files.
>
> the files you put in your META-INF/services directory should match the
> class name that Lucene is expecing and that you are extending -- so if you
> are extending PostingFormat, then you register that as a PostingFormat
> implementation in a
> META-INF/services/org.apache.lucene.codecs.PostingFormat file inside your
> jar.
>
> you don't need/want to specify any other classes in those service files.
> In the service file you mentioned trying ot use...
>
> : Contents of  META-INF/services/org.apache.lucene.codecs.Codec in the jar
> : file:
> : org.apache.lucene.codecs.lucene49.Lucene49Codec
> : org.apache.lucene.codecs.lucene410.Lucene410Codec
> : # tbw adds custom wrapper here per Hoss e-mail
> : org.apache.lucene.codecs.HTPostingsFormatWrapper
>
> ...you're telling the JVM 3 things that are incorrect...
>
> 1) that your jar contains a Lucene49Codec class (it does not)
> 2) that your jar contains a Lucene410Codec class (also no)
> 3) that your jar contains an HTPostingsFormatWrapper (true but) which
> extends Codec (it does not)
>
>
> ...and at no point does your services file tell SPI that your
> HTPostingsFormatWrapper is/can-be a PostingsFormat.
>
> Your speciic error seems to be coming from Lucene trying to scan the list
> of SPI loadable *Codec* implementations, and being confused because you've
> said HTPostingsFormatWrapper is an implementation of "Codec" but it can't
> be cast as a Codec.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Custom PostingsFormat SPILoader issues

2015-03-13 Thread Tom Burton-West
Thanks again Uwe (and Hoss and Mike)

Just reread your message and saw that I hadn't paid enough attention to :
 If you just want to create your own PostingsFormat, you have to put it
into the other META-INF file for org.apache.lucene.codecs.PostingsFormats.

I put the wrong META-INF file in the jar.   I'll try putting an entry in
META-INF/services/org.apache.lucene.codecs.PostingsFormat in the jar .

Sorry for not reading your message carefully enough before sending a
response.

Tom



On Fri, Mar 13, 2015 at 12:13 PM, Tom Burton-West 
wrote:

> Thanks Uwe,
>
> I'm pretty much going from what Hoss told me in the thread here::
> http://lucene.472066.n3.nabble.com/How-to-configure-Solr-PostingsFormat-block-size-tt4179029.html
>
> All I am really trying to do is instantiate the
> regular Lucene41PostingsFormat with non-default minTermBlockSize and
> maxTermBlockSize parameters.  However, that apparently can't be done in
> schema.xml.   So Hoss suggested a wrapper class around PostingsFormat that
> instantiates the Lucene41PostingsFormat with the desired parameters:
>
> "where does that leave you as a solr user who wants to write a plugin,
> since Solr only allows you to configure the SPI name (no constructor
> args) via 'postingFormat="foo"' the anwser is that instead of writing a
> subclass, you would have to write a small proxy class, something like...
>
> public final class MyPfWrapper extends PostingFormat {
>   PostingFormat pf = new Lucene50PostingsFormat(42, 9);
>   public MyPfWrapper() {
> super("MyPfWrapper");
>   }
> 
> rest of code skipped.
>
> I don't really understand SPI and class loaders, but you are right this
> class is a subclass of PostingsFormat not Codecs.   So is there an issue
> with the whole idea, or is there just some subtlety of class loading and
> the SPILoader I'm not understanding?
>
> Tom
>
>
>
>
>
>
> On Fri, Mar 13, 2015 at 11:35 AM, Uwe Schindler  wrote:
>
>> Hi,
>>
>>
>>
>> To me this looks like the implementing class is not a real subclass of
>> org.apache.lucene.codecs.Codec – because you said “PostingsFormat” not
>> “Codec” in your mail? If you just want to create your own PostingsFormat,
>> you have to put it into the other META-INF file for
>> org.apache.lucene.codecs.PostingsFormats. Creating own codecs is in most
>> cases not needed, most people are only interested in postings formats.
>>
>>
>>
>> Another reason for this could be that the JAR file with the codec is in a
>> different classloader than the one of lucene-core.jar.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: u...@thetaphi.de
>>
>>
>>


Re: Custom PostingsFormat SPILoader issues

2015-03-13 Thread Tom Burton-West
Thanks Uwe,

I'm pretty much going from what Hoss told me in the thread here::
http://lucene.472066.n3.nabble.com/How-to-configure-Solr-PostingsFormat-block-size-tt4179029.html

All I am really trying to do is instantiate the
regular Lucene41PostingsFormat with non-default minTermBlockSize and
maxTermBlockSize parameters.  However, that apparently can't be done in
schema.xml.   So Hoss suggested a wrapper class around PostingsFormat that
instantiates the Lucene41PostingsFormat with the desired parameters:

"where does that leave you as a solr user who wants to write a plugin,
since Solr only allows you to configure the SPI name (no constructor args)
via 'postingFormat="foo"' the anwser is that instead of writing a subclass,
you would have to write a small proxy class, something like...

public final class MyPfWrapper extends PostingFormat {
  PostingFormat pf = new Lucene50PostingsFormat(42, 9);
  public MyPfWrapper() {
super("MyPfWrapper");
  }

rest of code skipped.

I don't really understand SPI and class loaders, but you are right this
class is a subclass of PostingsFormat not Codecs.   So is there an issue
with the whole idea, or is there just some subtlety of class loading and
the SPILoader I'm not understanding?

Tom






On Fri, Mar 13, 2015 at 11:35 AM, Uwe Schindler  wrote:

> Hi,
>
>
>
> To me this looks like the implementing class is not a real subclass of
> org.apache.lucene.codecs.Codec – because you said “PostingsFormat” not
> “Codec” in your mail? If you just want to create your own PostingsFormat,
> you have to put it into the other META-INF file for
> org.apache.lucene.codecs.PostingsFormats. Creating own codecs is in most
> cases not needed, most people are only interested in postings formats.
>
>
>
> Another reason for this could be that the JAR file with the codec is in a
> different classloader than the one of lucene-core.jar.
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>


Custom PostingsFormat SPILoader issues

2015-03-13 Thread Tom Burton-West
Hello,

I'm trying to configure Solr to use a custom Postings Format using the
SPILoader.

I specified my custom postings format in the  schema.xml file:



Then  I created a custom postings format class (its actually a simple
wrapper class), compiled a jar and included an
org.apache.lucene.codecs.Codec file in META-INF/services in the jar file
with an entry for the wrapper class :HTPostingsFormatWrapper.   I created a
collection1/lib directory and put the jar there. (see below)

I'm getting a  "ClassCastException Class.asSubclass(Unknown Source" error
(See below).

My first thought is that maybe putting a plugin class in collection1/lib is
no longer the best option and something about the order of loading classes
is causing a problem.

Any suggestions on how to troubleshoot this?.

For background see this thread on the Solr mailing list:
http://lucene.472066.n3.nabble.com/How-to-configure-Solr-PostingsFormat-block-size-tt4179029.html


Tom



error:
by: java.lang.ClassCastException: class
org.apache.lucene.codecs.HTPostingsFormatWrapper
 at java.lang.Class.asSubclass(Unknown Source)
 at org.apache.lucene.util.SPIClassIterator.next(SPIClassIterator.java:141)


---
Contents of the jar file:

C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\lib>jar -tvf
HTPostingsFormatWrapper.jar
25 Thu Mar 12 10:37:04 EDT 2015 META-INF/MANIFEST.MF
  1253 Thu Mar 12 10:37:04 EDT 2015
org/apache/lucene/codecs/HTPostingsFormatWrapper.class
  1276 Thu Mar 12 10:49:06 EDT 2015
META-INF/services/org.apache.lucene.codecs.Codec




Contents of  META-INF/services/org.apache.lucene.codecs.Codec in the jar
file:
org.apache.lucene.codecs.lucene49.Lucene49Codec
org.apache.lucene.codecs.lucene410.Lucene410Codec
# tbw adds custom wrapper here per Hoss e-mail
org.apache.lucene.codecs.HTPostingsFormatWrapper

-
log file excerpt with stack trace:

12821 [main] INFO  org.apache.solr.core.CoresLocator  – Looking for core
definitions underneath C:\d\solr\lucene_solr_4_10_2\solr\example\solr
12838 [main] INFO  org.apache.solr.core.CoresLocator  – Found core
collection1 in C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\
12839 [main] INFO  org.apache.solr.core.CoresLocator  – Found 1 core
definitions
12841 [coreLoadExecutor-5-thread-1] INFO
 org.apache.solr.core.SolrResourceLoader  – new SolrResourceLoader for
directory: 'C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\'
12842 [coreLoadExecutor-5-thread-1] INFO
 org.apache.solr.core.SolrResourceLoader  – Adding
'file:/C:/d/solr/lucene_solr_4_10_2/solr/example/solr/collection1/lib/HTPostingsFormatWrapper.jar'
to classloader
12870 [coreLoadExecutor-5-thread-1] ERROR
org.apache.solr.core.CoreContainer  – Error creating core [collection1]:
class org.apache.lucene.codecs.HTPostingsFormatWrapper
java.lang.ClassCastException: class
org.apache.lucene.codecs.HTPostingsFormatWrapper
at java.lang.Class.asSubclass(Unknown Source)
at org.apache.lucene.util.SPIClassIterator.next(SPIClassIterator.java:141)
at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:65)
at org.apache.lucene.codecs.Codec.reloadCodecs(Codec.java:119)
at
org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:206)
at
org.apache.solr.core.SolrResourceLoader.(SolrResourceLoader.java:142)
at
org.apache.solr.core.ConfigSetService$Default.createCoreResourceLoader(ConfigSetService.java:144)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:58)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:489)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


[jira] [Closed] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-03-06 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West closed SOLR-7175.
-
Resolution: Not a Problem

Problem was in our client code erroneously sending items to Solr to index after 
sending the optimize command.  Not a Solr issue.

>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
> Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, 
> build-4.iw.2015-02-25.txt.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-03-06 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350835#comment-14350835
 ] 

Tom Burton-West commented on SOLR-7175:
---

Hi Mike,
Thanks for taking a look.  We found a race condition in our code that resulted 
in the driver thinking all the indexers were finished when they sometimes 
weren't.  It just happened that we inserted this bug in the code about the time 
we switched from Solr 3.6 to Solr 4.10.2 so I jumped to the wrong conclusion.  
I'll go ahead and close the issue.

Tom

>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
>     Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, 
> build-4.iw.2015-02-25.txt.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-03-06 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350575#comment-14350575
 ] 

Tom Burton-West edited comment on SOLR-7175 at 3/6/15 4:56 PM:
---

Hi Mike,

Our code is supposed to completely finish indexing before then calling a commit 
and optimize.
I was trying to figure out how indexed documents could be in RAM after we 
called a commit and the resulting flush finished. Indexing should have 
completed prior to our code calling a commit and then optimize (ie. force 
merge).  We will double check our code and of course if we find a bug in the 
code we'll fix the bug, test, and  close the issue.   The reason we suspected 
something on the Solr4/Lucene4 end is that we haven't made any changes to the 
indexing/optimizing code in quite a while and we were not seeing this issue 
with Solr 3.6.




was (Author: tburtonwest):
Hi Mike,

Our code is supposed to completely finish indexing before then calling a commit 
and optimize.
I was trying to figure out how indexed documents could be in RAM after we 
called a commit and the resulting flush finished. Indexing should have 
completed prior to our code calling a commit and then optimize (ie. force 
merge).  We will double check our code and of course if we find a bug in the 
code we'll fix the bug, test, and  close the issue.   The reason we suspected 
something on the Solr4/Lucene4 end is that we haven't made any changes to the 
indexing/optimizing code in quite a while and we were not seeing this issue 
with Solr 4.6.



>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
>     Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, 
> build-4.iw.2015-02-25.txt.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-03-06 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350575#comment-14350575
 ] 

Tom Burton-West commented on SOLR-7175:
---

Hi Mike,

Our code is supposed to completely finish indexing before then calling a commit 
and optimize.
I was trying to figure out how indexed documents could be in RAM after we 
called a commit and the resulting flush finished. Indexing should have 
completed prior to our code calling a commit and then optimize (ie. force 
merge).  We will double check our code and of course if we find a bug in the 
code we'll fix the bug, test, and  close the issue.   The reason we suspected 
something on the Solr4/Lucene4 end is that we haven't made any changes to the 
indexing/optimizing code in quite a while and we were not seeing this issue 
with Solr 4.6.



>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
>     Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, 
> build-4.iw.2015-02-25.txt.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize maxSegments="2" not working right with Solr 4.10.2

2015-03-05 Thread Tom Burton-West
Hello all,

We are continuing to see inconsistent behavior with 
Out of 12 shards one  or two of them end up with more than 2 segments at
the finish of the optimize command. (and we see no errors in the logs)  So
far we have found no consistent pattern in which of the shards end up with
more than two segments.

I've opened a ticket:  Solr-7175 and attached some sample indexwriter logs
and our solrconfig.xml file.

Any suggestions on how to troubleshoot this would be appreciated.

Tom


[jira] [Updated] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-02-27 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-7175:
--
Attachment: build-4.iw.2015-02-25.txt.gz

Previous file did not have an explicit commit.
This file: build-4.iw.2015-02-25.txt includes a restart of Solr, a commit, and 
then the optimize maxSegments=2.   Same scenario where after the major merge 
down to 2 segments a flush finds docs in ram and additional segments are 
written to disk.

>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
> Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, 
> build-4.iw.2015-02-25.txt.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-02-27 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-7175:
--
Attachment: solr4.shotz

solrconfig.xml file

>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
> Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz, solr4.shotz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-02-27 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-7175:
--
Attachment: build-1.indexwriterlog.2015-02-23.gz

Attached is an indexwriter log where after a large merge down to 2 segments,  
startFullFlush was called and found additional docs in ram which were then 
written to 2 new segments.These new segments were not merged so the end result 
of calling  was a shard with 4 segments.

Attached also is our solrconfig.xml file in case the problem is caused by some 
configuration error that overides the maxSegments=2.


>  results in more than 2 segments after optimize 
> finishes
> ---
>
> Key: SOLR-7175
> URL: https://issues.apache.org/jira/browse/SOLR-7175
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.10.2
> Environment: linux
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: build-1.indexwriterlog.2015-02-23.gz
>
>
> After finishing indexing and running a commit, we issue an  maxSegments="2"/> to Solr.  With Solr 4.10.2 we are seeing one or two shards 
> (out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
> errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-7175) results in more than 2 segments after optimize finishes

2015-02-27 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-7175:
-

 Summary:  results in more than 2 
segments after optimize finishes
 Key: SOLR-7175
 URL: https://issues.apache.org/jira/browse/SOLR-7175
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10.2
 Environment: linux
Reporter: Tom Burton-West
Priority: Minor


After finishing indexing and running a commit, we issue an  to Solr.  With Solr 4.10.2 we are seeing one or two shards 
(out of 12) with 3 or 4 segments after the optimize finishes.  There are no 
errors in the Solr logs or indexwriter logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Optimize maxSegments="2" not working right with Solr 4.10.2

2015-02-25 Thread Tom Burton-West
Hi Toke,

We are using TieredMergePolicy and have had no problems with Solr 3.6
merging to 60GB or larger.  With Solr 4.10.2 our second shards are on the
order of 60-100GB so a 2GB limit is unlikely to be the cause of the problem
(unless I need to read the code to see what the context of the 2GB limit is
:)

As far as 2 vs 10 segments, it probably doesn't make too much difference,
but we have been optimizing down to two for several years now.(On the
other hand I assume that more segments means more disk seeks, but given our
huge index extra seeks are probably dwarfed by the disk transfer time to
pull data from the 600GB segment.  At this point we are just trying to
understand why Solr isn't consistently doing what we tell it.  i.e.
optimize down to two shards.



Tom

On Wed, Feb 25, 2015 at 12:40 PM, Toke Eskildsen 
wrote:

> I'm guessing that you are using LogByteSizeMergePolicy? As far as I can
> see from the code for Solr 4.8, normal maxMergeSize defaults to 2GB.
>
> There also seems to be a bug that effectively disables forced merge as the
> default value for maxMergeSizeForForcedMerge (used when forcing a merge) is
> set with
>
>   public static final double DEFAULT_MAX_MERGE_MB_FOR_FORCED_MERGE =
> Long.MAX_VALUE;
> maxMergeSizeForForcedMerge = (long)
> (DEFAULT_MAX_MERGE_MB_FOR_FORCED_MERGE*1024*1024);
>
> which overflows to -1048576. You should be able to fix this by setting
> maxMergeSizeForForcedMerge explicitly.
>
> Guessing here: Your forced merge attempt does nothing and the standard
> merge maxMergeSize is too low?
> (I should fine-read the code and make a JIRA if the error is present in
> Solr 5)
>
> BTW: Why 2 segments? Single segment indexes has memory and performance
> benefits, but there is not - to my knowledge - much difference between 2 or
> 4 (or 10) segments.
>
> - Toke Eskildsen
> 
> From: Tom Burton-West [tburt...@umich.edu]
> Sent: 25 February 2015 18:11
> To: dev@lucene.apache.org
> Subject: Fwd: Optimize maxSegments="2" not working right with Solr 4.10.2
>
> No replies on the Solr users list, so I thought I would repost to dev.
>
> We are continuing to see inconsistent behavior with  maxSegments="2"/>
> Out of 12 shards 1-3 of them end up with more than 2 segments at the
> finish of the optimize command. (and we see no errors in the logs)
>
>  The pattern seems the same in that after almost all of the segments are
> merged, one or two new segments are created when a startFullFlush happens
> after the big merge.
>
> Any suggestions on how to troubleshoot this would be appreciated.
>
> Tom
>
> -- Forwarded message --
> From: Tom Burton-West mailto:tburt...@umich.edu>>
> Date: Mon, Feb 23, 2015 at 12:41 PM
> Subject: Optimize maxSegments="2" not working right with Solr 4.10.2
> To: "solr-u...@lucene.apache.org<mailto:solr-u...@lucene.apache.org>" <
> solr-u...@lucene.apache.org<mailto:solr-u...@lucene.apache.org>>
> Cc: Phillip Farber mailto:pfar...@umich.edu>>,
> Sebastien Korner mailto:skor...@umich.edu>>
>
>
> Hello,
>
> We normally run an optimize with maxSegments="2"  after our daily
> indexing. This has worked without problem on Solr 3.6.  We recently moved
> to Solr 4.10.2 and on several shards the optimize completed with no errors
> in the logs, but left more than 2 segments.
>
> We send this xml to Solr
> 
>
> I've attached a copy of the indexwriter log for one of the segments where
> there were 4 segments rather than the requested number (i.e. there should
> have been only 2 segments) at the end of the optimize.It looks like a
> merge was done down to two segments and then somehow another process
> flushed some postings to disk creating two more segments.  Then there are
> messages about 2 of the remaining 4 segments being too big. (See below)
>
> What we expected is that the remainng 2 small segments (about 40MB) would
> get merged with the smaller of the two large segments, i.e. with the 56GB
> segment, since we gave the argument maxSegments=2.   This didn't happen.
>
>
> Any suggestions about how to troubleshoot this issue would be appreciated.
>
> Tom
>
> ---
> Excerpt from indexwriter log:
>
> TMP][http-8091-Processor5]: findForcedMerges maxSegmentCount=2  ...
> ...
> [IW][Lucene Merge Thread #0]: merge time 3842310 msec for 65236 docs
> ...
> [TMP][http-8091-Processor5]: findMerges: 4 segments
>  [TMP][http-8091-Processor5]:   seg=_1fzb(4.10.2):C1081559/24089:delGen=9
> size=672402.066 MB [skip: too large]
>  [TMP][http-8091-Processor5]:   seg=_1gj2(4.10.2):C65236/2:delGen=1
> size=56179.245 MB [skip: to

Fwd: Optimize maxSegments="2" not working right with Solr 4.10.2

2015-02-25 Thread Tom Burton-West
No replies on the Solr users list, so I thought I would repost to dev.

We are continuing to see inconsistent behavior with 
Out of 12 shards 1-3 of them end up with more than 2 segments at the finish
of the optimize command. (and we see no errors in the logs)

 The pattern seems the same in that after almost all of the segments are
merged, one or two new segments are created when a startFullFlush happens
after the big merge.

Any suggestions on how to troubleshoot this would be appreciated.

Tom

-- Forwarded message --
From: Tom Burton-West 
Date: Mon, Feb 23, 2015 at 12:41 PM
Subject: Optimize maxSegments="2" not working right with Solr 4.10.2
To: "solr-u...@lucene.apache.org" 
Cc: Phillip Farber , Sebastien Korner 


Hello,

We normally run an optimize with maxSegments="2"  after our daily indexing.
This has worked without problem on Solr 3.6.  We recently moved to Solr
4.10.2 and on several shards the optimize completed with no errors in the
logs, but left more than 2 segments.

We send this xml to Solr


I've attached a copy of the indexwriter log for one of the segments where
there were 4 segments rather than the requested number (i.e. there should
have been only 2 segments) at the end of the optimize.It looks like a
merge was done down to two segments and then somehow another process
flushed some postings to disk creating two more segments.  Then there are
messages about 2 of the remaining 4 segments being too big. (See below)

What we expected is that the remainng 2 small segments (about 40MB) would
get merged with the smaller of the two large segments, i.e. with the 56GB
segment, since we gave the argument maxSegments=2.   This didn't happen.


Any suggestions about how to troubleshoot this issue would be appreciated.

Tom

---
Excerpt from indexwriter log:

TMP][http-8091-Processor5]: findForcedMerges maxSegmentCount=2  ...
...
[IW][Lucene Merge Thread #0]: merge time 3842310 msec for 65236 docs
...
[TMP][http-8091-Processor5]: findMerges: 4 segments
 [TMP][http-8091-Processor5]:   seg=_1fzb(4.10.2):C1081559/24089:delGen=9
size=672402.066 MB [skip: too large]
 [TMP][http-8091-Processor5]:   seg=_1gj2(4.10.2):C65236/2:delGen=1
size=56179.245 MB [skip: too large]
 [TMP][http-8091-Processor5]:   seg=_1gj0(4.10.2):C16 size=44.280 MB
 [TMP][http-8091-Processor5]:   seg=_1gj1(4.10.2):C8 size=40.442 MB
 [TMP][http-8091-Processor5]:   allowedSegmentCount=3 vs count=4 (eligible
count=2) tooBigCount=2


build-1.iw.2015-02-23.txt.gz
Description: GNU Zip compressed data

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6192) Long overflow in LuceneXXSkipWriter can corrupt skip data

2015-01-26 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292536#comment-14292536
 ] 

Tom Burton-West commented on LUCENE-6192:
-

Patch works!  Thanks Mike!

Deployed Solr war with the patch and ran optimize on 12 shards.  All  
CheckIndexes passed. 
Below are some of the stats on one of the shards. 

Tom

 About 1 million docs and 700GB index with about 4 billion unique terms, 270 
billion tokens

docCount=1086381
 size (MB)=693,308.47

   test: terms, freq, prox...OK [4113882974 terms; 61631126560 terms/docs 
pairs; 270670957886 tokens]

 field "ocr":
index FST:
  27157406 nodes
  77300582 arcs
  1262090664 bytes
terms:
  4087713620 terms
  50227574334 bytes (12.3 bytes/term)
blocks:
  132202225 blocks
  96419097 terms-only blocks
  40757 sub-block-only blocks
  35742371 mixed blocks
  27202047 floor blocks
  44718055 non-floor blocks
  87484170 floor sub-blocks
  23560113026 term suffix bytes (178.2 suffix-bytes/block)
  8227225977 term stats bytes (62.2 stats-bytes/block)
  19664735257 other bytes (148.7 other-bytes/block)
  by prefix length:


> Long overflow in LuceneXXSkipWriter can corrupt skip data
> -
>
> Key: LUCENE-6192
> URL: https://issues.apache.org/jira/browse/LUCENE-6192
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk, 4.x
>
> Attachments: LUCENE-6192.patch
>
>
> I've been iterating with Tom on this corruption that CheckIndex detects in 
> his rather large index (720 GB in a single segment):
> {noformat}
>  java -Xmx16G -Xms16G -cp $JAR -ea:org.apache.lucene... 
> org.apache.lucene.index.CheckIndex //shards/4/core-1/data/test_index 
> -verbose 2>&1 |tee -a shard4_reoptimizedNewJava
> Opening index @ /htsolr/lss-reindex/shards/4/core-1/data/test_index
> Segments file=segments_e numSegments=1 version=4.10.2 format= 
> userData={commitTimeMSec=1421479358825}
>   1 of 1: name=_8m8 docCount=1130856
> version=4.10.2
> codec=Lucene410
> compound=false
> numFiles=10
> size (MB)=719,967.32
> diagnostics = {timestamp=1421437320935, os=Linux, 
> os.version=2.6.18-400.1.1.el5, mergeFactor=2, source=merge, 
> lucene.version=4.10.2, os.arch=amd64, mergeMaxNumSegments=1, 
> java.version=1.7.0_71, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.OK
> test: check integrity.OK
> test: check live docs.OK
> test: fields..OK [80 fields]
> test: field norms.OK [23 fields]
> test: terms, freq, prox...ERROR: java.lang.AssertionError: -96
> java.lang.AssertionError: -96
> at 
> org.apache.lucene.codecs.lucene41.ForUtil.skipBlock(ForUtil.java:228)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.skipPositions(Lucene41PostingsReader.java:925)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.nextPosition(Lucene41PostingsReader.java:955)
> at 
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1100)
> at 
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1357)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:655)
> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
> test: stored fields...OK [67472796 total field count; avg 59.665 
> fields per doc]
> test: term vectorsOK [0 total vector count; avg 0 term/freq 
> vector fields per doc]
> test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
> SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
> FAILED
> WARNING: fixIndex() would remove reference to this segment; full 
> exception:
> java.lang.RuntimeException: Term Index test failed
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:670)
> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
> WARNING: 1 broken segments (containing 1130856 documents) detected
> WARNING: would write new segments file, and 1130856 documents would be lost, 
> if -fix were specified
> {noformat}
> And Rob spotted long -> int casts in our skip list writers that look like 
> they could cause such corruption if a single high-freq term with many 
> positions required > 2.1 GB to write its positions into .pos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6192) Long overflow in LuceneXXSkipWriter can corrupt skip data

2015-01-21 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285955#comment-14285955
 ] 

Tom Burton-West commented on LUCENE-6192:
-

I'll report as soon as I have some results.   Still have about 10% (about 1.3 
million books or slightly less than a terabyte of OCR) to index.  Once that is 
done we will deploy a Solr war with the patch and optimize.  That will take 
overnight. When the optimize is done we will then run CheckIndex.   So 
hopefully by Friday I will have something to report. 

> Long overflow in LuceneXXSkipWriter can corrupt skip data
> -
>
> Key: LUCENE-6192
> URL: https://issues.apache.org/jira/browse/LUCENE-6192
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk, 4.x
>
> Attachments: LUCENE-6192.patch
>
>
> I've been iterating with Tom on this corruption that CheckIndex detects in 
> his rather large index (720 GB in a single segment):
> {noformat}
>  java -Xmx16G -Xms16G -cp $JAR -ea:org.apache.lucene... 
> org.apache.lucene.index.CheckIndex //shards/4/core-1/data/test_index 
> -verbose 2>&1 |tee -a shard4_reoptimizedNewJava
> Opening index @ /htsolr/lss-reindex/shards/4/core-1/data/test_index
> Segments file=segments_e numSegments=1 version=4.10.2 format= 
> userData={commitTimeMSec=1421479358825}
>   1 of 1: name=_8m8 docCount=1130856
> version=4.10.2
> codec=Lucene410
> compound=false
> numFiles=10
> size (MB)=719,967.32
> diagnostics = {timestamp=1421437320935, os=Linux, 
> os.version=2.6.18-400.1.1.el5, mergeFactor=2, source=merge, 
> lucene.version=4.10.2, os.arch=amd64, mergeMaxNumSegments=1, 
> java.version=1.7.0_71, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.OK
> test: check integrity.OK
> test: check live docs.OK
> test: fields..OK [80 fields]
> test: field norms.OK [23 fields]
> test: terms, freq, prox...ERROR: java.lang.AssertionError: -96
> java.lang.AssertionError: -96
> at 
> org.apache.lucene.codecs.lucene41.ForUtil.skipBlock(ForUtil.java:228)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.skipPositions(Lucene41PostingsReader.java:925)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.nextPosition(Lucene41PostingsReader.java:955)
> at 
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1100)
> at 
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1357)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:655)
> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
> test: stored fields...OK [67472796 total field count; avg 59.665 
> fields per doc]
> test: term vectorsOK [0 total vector count; avg 0 term/freq 
> vector fields per doc]
> test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
> SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
> FAILED
> WARNING: fixIndex() would remove reference to this segment; full 
> exception:
> java.lang.RuntimeException: Term Index test failed
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:670)
> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
> WARNING: 1 broken segments (containing 1130856 documents) detected
> WARNING: would write new segments file, and 1130856 documents would be lost, 
> if -fix were specified
> {noformat}
> And Rob spotted long -> int casts in our skip list writers that look like 
> they could cause such corruption if a single high-freq term with many 
> positions required > 2.1 GB to write its positions into .pos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



OOM errors and indexwriter log

2015-01-08 Thread Tom Burton-West
Hi all,


I'm experimenting with memory use in Solr 4.10.2.   Our index is currently
about 250GB and we have allocated 4GB to solr.  I'm getting  OOM  errors:

"java.lang.OutOfMemoryError: Java heap space at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray."
(More details appended below)

I'm not up to speed yet on the data structures the Solr 4 codecs use.  My
first guess is that this is occurring during a large merge where the size
of the array gets huge.  (We have about 3 billion unique terms per shard)

Can someone tell me what the FreqProxPostingsArray is and whether my guess
is in the right ballpark?

I can't seem to find anything about these errors in the indexwriter
(InfoStream) log.  What should I be looking for?  Should I set the
indexwriter log level to something other than INFO?

In Solr 3 we could set the TermIndexInterval to reduce the memory needed by
our huge number of terms.   Is there a similar setting for Solr 4?

Tom




---
summary error trace:

org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
Exception writing document id mdp.39015026399660 to the index; possible
analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168)

Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter
is closed
at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698)

java.lang.OutOfMemoryError: Java heap spaceat
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.&


Re: Security hole in Solr 4.10.2 example:Solrconfig turns on enableRemoteStreaming

2014-12-11 Thread Tom Burton-West
Thanks Hoss,

Ah, I didn't look at the timestamps on those revisions!

Personally, I'd prefer having the default set to false rather than true
because people don't always read the entire config file, but if there has
been discussion for several years, and its been decided to leave it enabled
in the example solrconfig.xml  I'll go along with it.

However, it might be good to fix the documentation for 4.10  because it
contradicts the code.
The current documentation 4.10 ref guide says it is " disabled by default"
which apparently has not been true for several years.  I just put a comment
in the current ref guide to this effect.

Tom


On Thu, Dec 11, 2014 at 3:02 PM, Chris Hostetter 
wrote:

>
> : In revision   743163 of  the Solr 4.10  example solrconfig.xml file
> : enableRemoteStreaming was (accidentally?)  changed from "false" to true.
>
> yeah ... that was 5 years ago.
>
> I dont remember specifically if it was an accident at the time, but the
> inclusion in release versions since has been intentional given the
> "example" nature of the file -- which is why SOLR-2397 added a very
> specific warning about it (starting with Solr 3.1) ...
>
>  *** WARNING ***
>  The settings below authorize Solr to fetch remote files, You
>  should make sure your system has some authentication before
>  using enableRemoteStreaming="true"
>
> (i don't have any links to mailing list discussions handy, but i do recall
> it was discussed repeatedly.)
>
>
> : Should I open a JIRA?
>
> Given SOLR-3619, i think it would probably be a good idea to change this
> to false in the new configset/data_driven_schema_configs &
> cofigset/basic_configs that we ship -- so yes, please open a jira for
> discussion ... but i don't really think it's a "security hole" or
> something that needs attention in a 4.10.x release.
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Security hole in Solr 4.10.2 example:Solrconfig turns on enableRemoteStreaming

2014-12-11 Thread Tom Burton-West
Hello,
In the released version as well as previous revisions starting at revision
74163 of the example solrconfig.xml file for Solr 4.10.2
 enableRemoteStreaming is set to "true".

Released version (See line 748)
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_10_2/solr/example/solr/collection1/conf/solrconfig.xml?revision=1635125&view=markup

In revision   743163 of  the Solr 4.10  example solrconfig.xml file
enableRemoteStreaming was (accidentally?)  changed from "false" to true.

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/example/solr/collection1/conf/solrconfig.xml?revision=734796&view=markup

Should I open a JIRA?

Tom


Background:


There is a warning in the solrconfig.xml example  file   " The settings
below authorize Solr to fetch remote files, You   should make sure your
system has some authentication before  using enableRemoteStreaming="true" "

The 4.10 reference guide p 204 says
" For security reasons, remote streaming is disabled in the solrconfig.xml
included in the example directory."

However in the latest revision of 4.10.2 it appears to be enabled as the
setting is

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/example/solr/collection1/conf/solrconfig.xml?revision=1638496&view=markup
  Line 748:



Tom


Re: Performance hit of Solr checkIntegrityAtMerge

2014-12-10 Thread Tom Burton-West
Thanks Robert!

Tom


> Start at SegmentMerger in both places.
>
> In 4.10.x you can see how it just validates every part of every reader
> in a naive loop:
>
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10/lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java#L58
>
> in 5.x it is not done with this loop, instead responsibility for the
> merge is in the codec API.
> So this is done "fine-grained" for each part of the index, for example
> in stored fields, we verify each reader's stored fields portion right
> before we merge it in that individual piece:
>
> https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/StoredFieldsWriter.java#L82
>
> Note the default codec optimizes merge() more for stored fields and
> term vectors with a bulk byte copy that verifies as it copies.
> This bulk copy case is the typical case, when you aren't "upgrading"
> old segments, using something like SortingMergePolicy, etc:
>
> https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsWriter.java#L355
>
>


Re: Performance hit of Solr checkIntegrityAtMerge

2014-12-10 Thread Tom Burton-West
Thanks Robert,

With indexes close to 1 TB in size, I/O is usually our big bottleneck.

Can you point me to where in the 4.x codebase and/or 5.x codebase I should
look to get a feel for what you mean by i/o locality?  Or should I be
looking at a JIRA issue?
is there a short explanation you might be able to supply?

Tom



On Wed, Dec 10, 2014 at 3:31 PM, Robert Muir  wrote:

> There are two costs: cpu and i/o.
>
> The cpu cost is not much anyway but can be made basically trivial by
> using java 8.
> The i/o cost is because the check is not done with any i/o locality to
> the data being merged. so it could be a perf hit for an extremely
> large merge.
>
> In 5.0 the option is removed: we reworked this computation in merging
> to always have locality and so on, the checking always happens.
>
> On Wed, Dec 10, 2014 at 2:51 PM, Tom Burton-West 
> wrote:
> > Hello all,
> >
> > In the example solrconfig.xml file for Solr 4.10.2 there is the comment
> > (appended below) that says that  setting checkIntegrityAtMerge to true
> > reduces risk of index corruption at the expense of slower merging.
> >
> > Can someone please point me to any benchmarks or details about the
> > trade-offs?   What kind of a slowdown occurs and what are the factors
> > affecting the magnitude of the slowdown?
> >
> > I have huge indexes with huge merges, so  I would really love to enable
> > integrity checking.  On the other hand, we have very rarely ever had a
> > problem with a corrupt index and we allways do checkIndexes  at the end
> of
> > the indexing process  when we are re-indexing the entire corpus.
> >
> > I'd like to get some kind of understanding of how much this will cost us
> in
> > merge speeds since re-indexing our corpus takes about 10 days and much of
> > that time is spent on merging.
> >
> > We index 13 millon books (nearly 4 billion pages) averaging about 100,000
> > tokens/book.  We now have about 1 millon books per shard.   Merging
> 30,000
> > volumes takes about  30 minutes, with larger merges taking longer.)
> >
> >
> >   
> >  false
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/Large-scale-Search
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Performance hit of Solr checkIntegrityAtMerge

2014-12-10 Thread Tom Burton-West
Hello all,

In the example solrconfig.xml file for Solr 4.10.2 there is the comment
(appended below) that says that  setting checkIntegrityAtMerge to true
reduces risk of index corruption at the expense of slower merging.

Can someone please point me to any benchmarks or details about the
trade-offs?   What kind of a slowdown occurs and what are the factors
affecting the magnitude of the slowdown?

I have huge indexes with huge merges, so  I would really love to enable
integrity checking.  On the other hand, we have very rarely ever had a
problem with a corrupt index and we allways do checkIndexes  at the end of
the indexing process  when we are re-indexing the entire corpus.

I'd like to get some kind of understanding of how much this will cost us in
merge speeds since re-indexing our corpus takes about 10 days and much of
that time is spent on merging.

We index 13 millon books (nearly 4 billion pages) averaging about 100,000
tokens/book.  We now have about 1 millon books per shard.   Merging 30,000
volumes takes about  30 minutes, with larger merges taking longer.)


  
 false

Tom Burton-West
http://www.hathitrust.org/blogs/Large-scale-Search


Re: queryResultMaxDocsCached vs. queryResultWindowSize

2014-09-29 Thread Tom Burton-West
Thanks for your help Yonik and Tomas,

I had several mistaken assumptions about how caching worked which were
resolved by walking through the code in the debugger after reading your
replies.

Tom


On Fri, Sep 26, 2014 at 4:55 PM, Yonik Seeley  wrote:

> On Fri, Sep 26, 2014 at 4:38 PM, Tom Burton-West 
> wrote:
> > Hi Yonik,
> >
> > I'm still confused.
> >
> >  suspect don't understand how paging and caching interact.  I probably
> need
> > to walk through the code.  Is there a unit test that exercises
> > SolrIndexSearcher.getDocListC or a good unit test to use as a base to
> write
> > one?
> >
> >
> > Part of what confuses me is whether what gets cached always starts at
> row 1
> > of results.
>
> Yes, we always cache from the first row.
> Asking for rows 91-100 requires collecting 1-100 (and it's the latter
> we cache - ignoring deep paging).
> It's also just ids (and optionally scores) that are cached... so
> either 4 bytes or 8 bytes per document cached, depending on if you ask
> for scores back.
>
> queryWindowSize just rounds up the upper bound.
>
> > I'll try to explain my confusion.
> > Using the defaults in the solrconfig example:
> > 20
> > 200
> >
> > If I query for start=0, rows =10  Solr fetches 20 results and caches
> them.
> > If I query for start =11 rows =10 Solr read rows 11-20 from cache
>
> Correct.
>
> > What happens when I query for start = 21 rows= 10?
> > I thought that Solr would then fetch rows 21-40 into the
> queryResultCache.
> > Is this wrong?
>
> It will result in a cache miss and we'll collect 0-40 and cache that.
>
> > If I query for start =195 rows =10  does Solr cache rows 195-200 but go
> to
> > disk for rows over 200 (queryResultMaxDocsCached=200)?   Or does Solr
> skip
> > caching altogether for rows over 200
>
> Probably the latter... it's an edge case so I'd have to check the code
> to know for sure if the check is pre or post rounding up.
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: queryResultMaxDocsCached vs. queryResultWindowSize

2014-09-26 Thread Tom Burton-West
Hi Yonik,

I'm still confused.

 suspect don't understand how paging and caching interact.  I probably need
to walk through the code.  Is there a unit test that exercises
SolrIndexSearcher.getDocListC
or a good unit test to use as a base to write one?


Part of what confuses me is whether what gets cached always starts at row 1
of results.  I did not think this was true, but your example of start=1
rows = 10 (ie rows 1-through 10010) triggering the
queryResultMacDocsCached limit of 200 makes it sound like the cache always
starts at row 1.  I would have thought that a request for start= 10,000
 rows=10,010 would result in Solr caching rows 10,000-10,020.

I'll try to explain my confusion.
Using the defaults in the solrconfig example:
20
200

If I query for start=0, rows =10  Solr fetches 20 results and caches them.
If I query for start =11 rows =10 Solr read rows 11-20 from cache
What happens when I query for start = 21 rows= 10?
I thought that Solr would then fetch rows 21-40 into the queryResultCache.
Is this wrong?

If I query for start =195 rows =10  does Solr cache rows 195-200 but go to
disk for rows over 200 (queryResultMaxDocsCached=200)?   Or does Solr skip
caching altogether for rows over 200



Tom

On Wed, Sep 24, 2014 at 7:12 PM, Yonik Seeley  wrote:

> On Wed, Sep 24, 2014 at 5:27 PM, Tomás Fernández Löbbe
>  wrote:
> > I think you are right. I think the name is this because it’s considering
> a
> > series of queries paging a result. The first X pages are going to be
> cached,
> > but once the limit is reached, no further pages are and the last superset
> > that fitted remains in cache.
>
> I was confused about the confusion ;-)  But your summary seems correct.
>
> queryResultWindowSize rounds up to a multiple of the window size for
> caching purposes.
> So if you ask for top 10, and the queryResultWindowSize is 20, then
> the top 20 will be cached (so if a user hits "next" to get to the next
> 10, it will still result in a cache hit).
>
> queryResultMaxDocsCached sets a limit beyond which the resulting docs
> aren't cached (so if a user asks for docs 1 through 10010, we skip
> caching logic).
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


queryResultMaxDocsCached vs. queryResultWindowSize

2014-09-24 Thread Tom Burton-West
Hello,

No response on the Solr user list so I thought I would try the dev list.


queryResultWindowSize sets the number of documents  to cache for each query
in the queryResult cache.So if you normally output 10 results per page,
and users don't go beyond page 3 of results, you could set
queryResultWindowSize to 30 and the second and third page requests will
read from cache, not from disk.  This is well documented in both the Solr
example solrconfig.xml file and the Solr documentation.

However, the example in solrconfig.xml and the documentation in the
reference manual for Solr 4.10 say that queryResultMaxDocsCached :

"sets the maximum number of documents to cache for any entry in the
queryResultCache".

Looking at the code  it appears that the queryResultMaxDocsCached parameter
actually tells Solr not to cache any results list that has a size  over
 queryResultMaxDocsCached:.

From:  SolrIndexSearcher.getDocListC
// lastly, put the superset in the cache if the size is less than or equal
// to queryResultMaxDocsCached
if (key != null && superset.size() <= queryResultMaxDocsCached &&
!qr.isPartialResults()) {
  queryResultCache.put(key, superset);
}

Deciding whether or not to cache a DocList if its size is over N (where N =
queryResultMaxDocsCached) is very different than caching only N items from
the DocList which is what the current documentation (and the variable name)
implies.

Looking at the JIRA issue https://issues.apache.org/jira/browse/SOLR-291
the original intent was to control memory use and the variable name
originally suggested was  "noCacheIfLarger"

Can someone please let me know if it is true that the
queryResultMaxDocsCached parameter actually tells Solr not to cache any
results list that contains over the  queryResultMaxDocsCached?

If so, I will add a comment to the Cwiki doc and open a JIRA and submit a
patch to the example file.

I tried to find a test case that excercises SolrIndexSearcher.getDocListC
so I could see how  queryResultWindowSize or queryResultMaxDocsCached
actually work in the debugger but could not find a test case.  Could
someone please point me to a good test case that either excercises
SolrIndexSearcher.getDocListC or would be a good starting point for writing
one?


Tom



---

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/example/solr/collection1/conf/solrconfig.xml?revision=1624269&view=markup

635 
638 200


[jira] [Updated] (SOLR-6560) Solr example file has outdated termIndexInterval entry

2014-09-24 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-6560:
--
Attachment: SOLR-6560.patch

Patch removes offending lines in example solrconfig.xml

> Solr example file has outdated termIndexInterval entry
> --
>
> Key: SOLR-6560
> URL: https://issues.apache.org/jira/browse/SOLR-6560
> Project: Solr
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 4.10
>    Reporter: Tom Burton-West
>Priority: Minor
> Attachments: SOLR-6560.patch
>
>
> The termIndexInterval comment and example settings in the example 
> solrconfig.xml file is left over from Solr 3.x versions.  It does not apply 
> to the default Solr  4.x installation and its presence in the example is 
> confusing.  
> According to the JavaDocs for IndexWriterConfig, the Lucene level
> implementations of setTermIndexInterval and setReaderTermsIndexDivisor these 
> do not apply to the default Solr4 PostingsFormat implementation.  
> From 
> (http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>  )
> "This parameter does not apply to all PostingsFormat implementations, 
> including the default one in this release. It only makes sense for term 
> indexes that are implemented as a fixed gap between terms. For example, 
> Lucene41PostingsFormat implements the term index instead based upon how terms 
> share prefixes. To configure its parameters (the minimum and maximum size for 
> a block), you would instead use 
> Lucene41PostingsFormat.Lucene41PostingsFormat(int, int). which can also be 
> configured on a per-field basis:"
> The (soon to be ) attached patch just removes the outdated example. 
> Documentation on the wiki and Solr ref guide should also be updated.
> If the latest Solr default postings format can be configured from Solr, 
> perhaps someone with knowledge of the use case and experience configuring it 
> could provide a suitable example.   Since the Solr 4 default postingsformat 
> is so much more efficient than Solr 3.x, there might no longer be a use case 
> for messing with the parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-6560) Solr example file has outdated termIndexInterval entry

2014-09-24 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-6560:
-

 Summary: Solr example file has outdated termIndexInterval entry
 Key: SOLR-6560
 URL: https://issues.apache.org/jira/browse/SOLR-6560
 Project: Solr
  Issue Type: Bug
  Components: documentation
Affects Versions: 4.10
Reporter: Tom Burton-West
Priority: Minor


The termIndexInterval comment and example settings in the example 
solrconfig.xml file is left over from Solr 3.x versions.  It does not apply to 
the default Solr  4.x installation and its presence in the example is 
confusing.  

According to the JavaDocs for IndexWriterConfig, the Lucene level
implementations of setTermIndexInterval and setReaderTermsIndexDivisor these do 
not apply to the default Solr4 PostingsFormat implementation.  

>From 
>(http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> )
"This parameter does not apply to all PostingsFormat implementations, including 
the default one in this release. It only makes sense for term indexes that are 
implemented as a fixed gap between terms. For example, Lucene41PostingsFormat 
implements the term index instead based upon how terms share prefixes. To 
configure its parameters (the minimum and maximum size for a block), you would 
instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, int). which can 
also be configured on a per-field basis:"

The (soon to be ) attached patch just removes the outdated example. 
Documentation on the wiki and Solr ref guide should also be updated.

If the latest Solr default postings format can be configured from Solr, perhaps 
someone with knowledge of the use case and experience configuring it could 
provide a suitable example.   Since the Solr 4 default postingsformat is so 
much more efficient than Solr 3.x, there might no longer be a use case for 
messing with the parameters.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re:

2014-05-17 Thread Tom Burton-West
Thanks Mikhail,

I understand its expensive, but it appears that it is not freeing up memory
after each debugQuery is run.  That seems like it should be avoidable (I
say that without having looked at the code).  Should I open a JIRA about a
possible memory leak?

Tom


On Sat, May 17, 2014 at 8:20 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> For Sure. Lucene's explain is really expensive and is not purposed for
> production use, but only for rare troubleshooting. As a mitigation measure
> you can scroll result set by small portions more efficient like Hoss
> recently explained at SearchHub. In such kind of problems, usually it's
> possible to create sort of specialized custom collectors doing something
> particular.
>
> Have a god day!
>
>
> On Sat, May 17, 2014 at 3:01 AM, Tom Burton-West wrote:
>
>> Hello all,
>>
>>
>> I'm trying to get relevance scoring information for each of 1,000 docs
>> returned for each of 250 queries.If I run the query (appended below)
>> without debugQuery=on, I have no problem with getting all the results
>> with under 4GB of memory use.  If I add the parameter &debugQuery=on,
>> memory use goes up continuously and after about 20 queries (with 1,000
>> results each), memory use reaches about 29.1 GB and the garbage collector
>> gives up:
>>
>> " org.apache.solr.common.SolrException; null:java.lang.RuntimeException:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded"
>>
>> I've attached a jmap -histo, exgerpt below.
>>
>> Is this a known issue with debugQuery?
>>
>> Tom
>> 
>> query:
>>
>>
>> q=Abraham+Lincoln&fl=id,score&indent=on&wt=json&start=0&rows=1000&version=2.2&
>> debugQuery=on
>>
>> without debugQuery=on:
>>
>>
>> q=Abraham+Lincoln&fl=id,score&indent=on&wt=json&start=0&rows=1000&version=2.2
>>
>> num   #instances#bytes  Class description
>> --
>> 1:  585,559 10,292,067,456  byte[]
>> 2:  743,639 18,874,349,592  char[]
>> 3:  53,821  91,936,328  long[]
>> 4:  70,430  69,234,400  int[]
>> 5:  51,348  27,111,744
>>  org.apache.lucene.util.fst.FST$Arc[]
>> 6:  286,357 20,617,704  org.apache.lucene.util.fst.FST$Arc
>> 7:  715,364 17,168,736  java.lang.String
>> 8:  79,561  12,547,792  * ConstMethodKlass
>> 9:  18,909  11,404,696  short[]
>> 10: 345,854 11,067,328  java.util.HashMap$Entry
>> 11: 8,823   10,351,024  * ConstantPoolKlass
>> 12: 79,561  10,193,328  * MethodKlass
>> 13: 228,587 9,143,480
>> org.apache.lucene.document.FieldType
>> 14: 228,584 9,143,360   org.apache.lucene.document.Field
>> 15: 368,423 8,842,152   org.apache.lucene.util.BytesRef
>> 16: 210,342 8,413,680   java.util.TreeMap$Entry
>> 17: 81,576  8,204,648   java.util.HashMap$Entry[]
>> 18: 107,921 7,770,312   org.apache.lucene.util.fst.FST$Arc
>> 19: 13,020  6,874,560
>> org.apache.lucene.util.fst.FST$Arc[]
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


[jira] [Updated] (SOLR-5978) Warning for SOLR-5522 (file/edit) should be removed from example solrconfig.xml

2014-04-09 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-5978:
--

Attachment: SOLR-5522.patch

Patch to example solrconfig.xml removes confusing comment

> Warning for SOLR-5522 (file/edit) should be removed from example 
> solrconfig.xml
> ---
>
> Key: SOLR-5978
> URL: https://issues.apache.org/jira/browse/SOLR-5978
> Project: Solr
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 4.7.1
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: SOLR-5522.patch
>
>
> In /SOLR-5522  the handler configuration code for the admin/fileedit request 
> handler which would o allow modification of Solr Config files was removed 
> from the example solrconfig.xml, but the comments were left in the example 
> file.   New users may be confused by a warning about a possible security 
> vulnerability which actually applies to a handler configuration that was 
> removed from the example.
> Patch coming as soon as I get an issue number



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5978) Warning for SOLR-5522 (file/edit) should be removed from example solrconfig.xml

2014-04-09 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-5978:
-

 Summary: Warning for SOLR-5522 (file/edit) should be removed from 
example solrconfig.xml
 Key: SOLR-5978
 URL: https://issues.apache.org/jira/browse/SOLR-5978
 Project: Solr
  Issue Type: Bug
  Components: documentation
Affects Versions: 4.7.1
Reporter: Tom Burton-West
Priority: Trivial


In /SOLR-5522  the handler configuration code for the admin/fileedit request 
handler which would o allow modification of Solr Config files was removed from 
the example solrconfig.xml, but the comments were left in the example file.   
New users may be confused by a warning about a possible security vulnerability 
which actually applies to a handler configuration that was removed from the 
example.

Patch coming as soon as I get an issue number



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr 4.7 example solrconfig.xml has confusing comments about a security vulnerability

2014-04-09 Thread Tom Burton-West
Hi Shawn,

OK, will do.  (but later today, since I have to eat lunch and then go to a
meeting).

Tom


On Wed, Apr 9, 2014 at 1:19 PM, Shawn Heisey  wrote:

> On 4/9/2014 11:13 AM, Tom Burton-West wrote:
>
>> In /SOLR-5522  the handler configuration code for the admin/fileedit
>> request handler which would o allow modification of Solr Config files was
>> removed from the example solrconfig.xml, but the comments were left in the
>> example file.
>>
>> http://svn.apache.org/viewvc/lucene/dev/branches/lucene_
>> solr_4_7/solr/example/solr/collection1/conf/solrconfig.
>> xml?r1=1547261&r2=1547270
>>
>>  Thus the warning (appended below) was left in the example
>> solrconfig.xml.   I spent a bit of time trying to figure out how the
>> ping/healthcheck request handler would allow the Solr UI to edit config
>> files before I figured out that the comment applied to a request handler
>> that had been removed from the example file.
>>
>> Should I open a JIRA issue and provide a patch?
>>
>
> Definitely.  If there's a respin on the 4.7.2 release, I'll try to get it
> in there.
>
> Thanks,
> Shawn
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Solr 4.7 example solrconfig.xml has confusing comments about a security vulnerability

2014-04-09 Thread Tom Burton-West
In /SOLR-5522  the handler configuration code for the admin/fileedit
request handler which would o allow modification of Solr Config files was
removed from the example solrconfig.xml, but the comments were left in the
example file.

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_7/solr/example/solr/collection1/conf/solrconfig.xml?r1=1547261&r2=1547270

 Thus the warning (appended below) was left in the example solrconfig.xml.
  I spent a bit of time trying to figure out how the ping/healthcheck
request handler would allow the Solr UI to edit config files before I
figured out that the comment applied to a request handler that had been
removed from the example file.

Should I open a JIRA issue and provide a patch?


Tom


  
  
  



Re: Solr Block-Join requires uniqueKey field to be int?

2014-03-04 Thread Tom Burton-West
Thanks Yonik,

It works fine with a String.

How embarassing,  Somehow I managed to accidentally set _root_ to an int in
my schema. Don't know how I did it.

Tom



On Tue, Mar 4, 2014 at 11:56 AM, Yonik Seeley  wrote:

> On Tue, Mar 4, 2014 at 11:51 AM, Tom Burton-West 
> wrote:
> > We have been using strings for our uniqueKey field and discovered that
> Solr
> > Block-Join requires the uniqueKey field to be an int.   This is because
> the
> > magic field _root_ is required to be an int, and for children it gets
> > populated from the uniqueKey field of the parent record.
>
> Are you sure it doesn't work with String?  The example field has
> _root_ defined to be:
>
>
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Solr Block-Join requires uniqueKey field to be int?

2014-03-04 Thread Tom Burton-West
Hello all,

We have been using strings for our uniqueKey field and discovered that Solr
Block-Join requires the uniqueKey field to be an int.   This is because the
magic field _root_ is required to be an int, and for children it gets
populated from the uniqueKey field of the parent record.

Would it be possible to change this to accomodate uniqueKeys that are
strings or is there some operation that requires ints for the field that
links the parent and child records  in the Solr Block-Join logic?

Tom


Re: Trade-offs in choosing DocValuesFormat

2014-02-01 Thread Tom Burton-West
Thanks Shawn, Joel, and Robert,

Shawn, thanks for mentioning the caveat of having to re-index when
upgrading Solr.  We almost always re-index when we upgrade Solr.


>>There is a ton of misinformation in this thread.
I think this might be because the DocValues implementation is a moving
target, and that the documentation has not kept up.

>>As of lucene 4.5, the default docvalues are disk-based >>(mostly, some
small stuff in ram).
>>You probably don't need to change anything from the defaults, unless:

>>if you want everything in RAM, use Memory.
>>If you want to waste RAM, use Direct.
>>If you have no RAM, use Disk.

Should I try to edit the Solr wiki (which talks about 4.2 and says the
default is to put everything in memory)  or is the idea that the cwiki is
where people should look for current documentation?
One of the things that confused me was that the cwiki pointed to the
outdated Solr wiki entry on DocValues.

I think I understand the use cases where someone would want everything in
RAM or everything on Disk.  I'm assuming that the default (4.5) makes some
trade-off by putting some important data structures in RAM.

Where should I look (maybe a JIRA issue?) to understand the use case for
Direct?   Maybe adding a sentence to the JavaDoc for Direct explaining why
someone would want to use it would be useful.

p.s. Robert, I saw your edits on the cwiki and I really appreciate that
with all the time you spend working on code, that you take the time to help
with the docs.


Tom


Trade-offs in choosing DocValuesFormat

2014-01-31 Thread Tom Burton-West
When trying to facet on 200 million documents with a facet field that has a
very large number of unique values, we are running into OOM's.  See this
thread for background:
http://lucene.472066.n3.nabble.com/Estimating-peak-memory-use-for-UnInvertedField-faceting-tt4100044.html

Otis suggested that using DocValues might solve the memory issues.

There seem to be several options for setting the DocValuesFormat.  Can
someone please clarify what the choices are for Solr 4.6 and what the
trade-offs are in terms of memory use and faceting performance?

Without digging into the code and doing some performance testing its
difficult to understand the existing documentation.   I'd really appreciate
hearing from people familiar with the issues before I create 3 different
indexes of 200 million documents to compare each of the options for
DocValuesFormat.

Some details of the documentation are appended below.

My apologies if this question should go to Lucene user instead of dev.  If
it should, please let me know and also let me know how I can tell which
list to ask.


Tom Burton-West

--

  The documentation on the Solr wiki seems to be for Solr 4.2 and seems to
contradict the cwiki reference guide:

Cwiki ref guide:
https://cwiki.apache.org/confluence/display/solr/DocValues

"The default implementation employs a mixture of loading some things into
memory and keeping some on disk. In some cases, however, you may choose to
either keep everything on disk or keep it in memory. You can do this by
defining docValuesFormat="Disk" or docValuesFormat="Memory" on the field
type. This example shows defining the format as "Disk"

Solr Wiki:
http://wiki.apache.org/solr/DocValues
docValuesFormat="Lucene42": This is the default, which loads everything
into heap memory.

docValuesFormat="Disk": This implementation has a different layout, to try
to keep most data on disk but with reasonable performance.

docValuesFormat="SimpleText": Plain-text, slow, and not for production.

On the other hand, the Lucene JavaDocs for Lucene 4.6 show both a
DiskDocValuesFormat
http://lucene.apache.org/core/4_6_1/codecs/org/apache/lucene/codecs/diskdv/DiskDocValuesFormat.html

 and a DirectDocValuesFormat
http://lucene.apache.org/core/4_6_1/codecs/org/apache/lucene/codecs/memory/DirectDocValuesFormat.html

as well as Lucene4(0|2|5) and PerFieldDocValues Formats.
http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/codecs/DocValuesFormat.html?is-external=true


Re: Estimating peak memory use for UnInvertedField faceting

2013-11-11 Thread Tom Burton-West
Thanks Otis,

 I'm looking forward to the presentation videos.

I'll look into using DocValues.Re-indexing 200 million docs will take a
while though :).
Will Solr automatically use DocValues for faceting if you have DocValues
for the field or is there some configuration or parameter that needs to be
set?

Tom


On Sat, Nov 9, 2013 at 9:57 AM, Otis Gospodnetic  wrote:

> Hi Tom,
>
> Check http://blog.sematext.com/2013/11/09/presentation-solr-for-analytics/
> .  It includes info about our experiment with DocValues, which clearly
> shows lower heap usage, which means you'll get further without getting
> this OOM.  In our experiments we didn't sort, facet, or group, and I
> see you are faceting, which means that DocValues, which are more
> efficient than FieldCache, should help you even more than it helped
> us.
>
> The graphs are from SPM, which you could use to monitor your Solr
> cluster, at least while you are tuning it.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Nov 8, 2013 at 2:41 PM, Tom Burton-West 
> wrote:
> > Hi Yonik,
> >
> > I don't know enough about JVM tuning and monitoring to do this in a clean
> > way, so I just tried setting the max heap at 8GB and then 6GB to force
> > garbage collection.  With it set to 6GB it goes into  a long GC loop and
> > then runs out of heap (See below) .  The stack trace says the issue is
> with
> > DocTErmOrds.uninvert:
> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:405)
> >
> >  I'm guessing the actual peak is somewhere between 6 and 8 GB.
> >
> > BTW: is there some documentation somewhere that explains what the stats
> > output to INFO mean?
> >
> > Tom
> >
> >
> > java.lang.OutOfMemoryError: GC overhead limit exceeded > name="trace">java.lang.RuntimeException: java.lang.OutOfMemoryError: GC
> > overhead limit exceeded
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
> > at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
> > at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
> > at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
> > at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
> > at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
> > at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
> > at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
> > at
> >
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
> > at
> >
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
> > at
> >
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
> > at
> >
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
> > at java.lang.Thread.run(Thread.java:724)
> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:405)
> > at
> org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
> > at
> >
> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)
> > at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252)
> > at
> >
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78)
> > at
> >
> org.apache.solr

Re: Estimating peak memory use for UnInvertedField faceting

2013-11-08 Thread Tom Burton-West
Hi Yonik,

I don't know enough about JVM tuning and monitoring to do this in a clean
way, so I just tried setting the max heap at 8GB and then 6GB to force
garbage collection.  With it set to 6GB it goes into  a long GC loop and
then runs out of heap (See below) .  The stack trace says the issue is with
DocTErmOrds.uninvert:
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:405)

 I'm guessing the actual peak is somewhere between 6 and 8 GB.

BTW: is there some documentation somewhere that explains what the stats
output to INFO mean?

Tom


java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.RuntimeException: java.lang.OutOfMemoryError: GC
overhead limit exceeded
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:405)
at org.apache.solr.request.UnInvertedField.(UnInvertedField.java:179)
at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)
at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
... 16 more


---
Nov 08, 2013 1:39:26 PM org.apache.solr.request.UnInvertedField 
INFO: UnInverted multi-valued field {field=topicStr,
memSize=1,768,101,824,
tindexSize=86,028,
time=45,854,
phase1=41,039,
nTerms=271,987,
bigTerms=0,
termInstances=569,429,716,
uses=0}
Nov 08, 2013 1:39:28 PM org.apache.solr.core.SolrCore execute

INFO: [core] webapp=/dev-3 path=/select
params={facet=true&facet.mincount=100&indent=true&q=ocr:the&facet.limit=30&facet.field=topicStr&wt=xml}
hits=138,605,690 status=0 QTime=49,797



On Fri, Nov 8, 2013 at 2:01 PM, Yonik Seeley  wrote:

> On Fri, Nov 8, 2013 at 1:56 PM, Tom Burton-West 
> wrote:
> > When testing an index of about 200 million documents, when we do a first
> > faceting on one field (query appended below), the memory use rises from
> > about 2.5 GB to 13GB.  If I run GC after the query the memory use goes
> down
> > to about 3GB and subsequent queries don't significantly increase the
> memory
> > use.
>
> Is there a way to tell what the real max memory usage is?  I assume
> 13GB is just the peak heap usage, but that could include a lot of
> garbage.
>
> -Yonik
> http://heliosearch.com -- making solr shine
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Estimating peak memory use for UnInvertedField faceting

2013-11-08 Thread Tom Burton-West
We are considering indexing our 11 million books at a page level, which
comes to about 3 billion Solr documents.

Our subject field  by necessity is multi-valued so the UnInvertedField is
used for faceting.

When testing an index of about 200 million documents, when we do a first
faceting on one field (query appended below), the memory use rises from
about 2.5 GB to 13GB.  If I run GC after the query the memory use goes down
to about 3GB and subsequent queries don't significantly increase the memory
use.

After the query is run various statistics from UnInvertedField are sent to
the log (see below), but they seem to represent the final data structure
rather than the peak.  For example memSize is listed as 1.8GB, while the
temporary data structure was probably closer to 10GB (total 13GB).

Is there a formula for estimating the peak memory size?
Can the statistics spit out to INFO be used to somehow estimate the peak
memory size?

Tom
-

Nov 08, 2013 1:39:26 PM org.apache.solr.request.UnInvertedField 
INFO: UnInverted multi-valued field {field=topicStr,
memSize=1,768,101,824,
tindexSize=86,028,
time=45,854,
phase1=41,039,
nTerms=271,987,
bigTerms=0,
termInstances=569,429,716,
uses=0}
Nov 08, 2013 1:39:28 PM org.apache.solr.core.SolrCore execute

INFO: [core] webapp=/dev-3 path=/select
params={facet=true&facet.mincount=100&indent=true&q=ocr:the&facet.limit=30&facet.field=topicStr&wt=xml}
hits=138,605,690 status=0 QTime=49,797


Lucene40TermVectorsReader TVTermsEnum totalTermFreq() is not a total

2013-10-25 Thread Tom Burton-West
Hi all,

I was reading some code that calls Lucene40TermVectorsReader
TVTermsEnum

The method totalTermFreq() actually returns freq and the method docFreq()
returns 1.
Once you think about the context this sort of makes sense but I found this
confusing.

I'm guessing there is a good reason for the method to be called
totalTermFreq(), but I would like to know what that is.  Also is there
documentation somewhere in the javadocs that explains this?

Better yet, is there a good example of how to use the Lucene 4.x
TermVectors API?


Tom


Is it possible to correct the Changes list re: Block-Join available in 4.4 vs 4.5

2013-10-07 Thread Tom Burton-West
Hello,

The JIRA issue SOLR-3076, includes a note stating that Solr  Block-Join
capability was mistakenly listed in the 4.4 section of CHANGEs.txt, but is
first available in Solr 4.5

Is it possible to revise the Changes.txt so that people looking at at
changes for 4.5 will realize that Block-Join functionality is available
 (Not everyone expands all the entries in the web version of CHANGES.txt to
look at previous versions)

Tom


Luceneutil high variability between runs

2013-08-16 Thread Tom Burton-West
Hello,

I'm trying to benchmark a change to BM25Similarity (LUCENE-5175 )using
luceneutil

I'm running this on a lightly loaded machine with a load average (top) of
about 0.01 when the benchmark is not running.

I made the following changes:
1) localrun.py changed Competition(debug=True) to Competition(debug=False)
2) made the following changes to localconstants.py per Robert Muir's
suggestion:
JAVA_COMMAND = 'java -server -Xms4g -Xmx4g'
SEARCH_NUM_THREADS = 1
3) for the BM25 tests set SIMILARITY_DEFAULT='BM25Similarity'
4) for the BM25 tests uncommened   the following line from searchBench.py
#verifyScores = False

Attached is output from iter 19 of several runs

The first 4 runs show consistently that the modified version is somewhere
between 6% and 8% slower on the tasks with the highest difference between
trunk and patch.
However if you look at the baseline TaskQPS, for HighTerm, for example,
 run 3 is about 55 and run 1 is about 88.  So the difference for this task
 between different runs of the bench program is very much higher than the
differences between trunk and modified/patch within a run.

Is this to be expected?   Is there a reason I should believe  the
differences shown within a run reflect the true differences?

Seeing this variability, I then switched DEFAULT_SIMILARITY back to
"DefaultSimilarity".  In this case trunk and my_modified, should be
exercising exactly the same code, since the only changes in the patch are
the addition of a test case for BM25Similarity and a change to
BM25Similarity.

In this case the "modified" version varies from -6.2% difference from the
base to +4.4% difference from the base for LowTerm.
Comparing  QPS for the base case for HighTerm between different runs we can
see it varies from about 21 for run 1 to 76 for run 3.

Is this kind of  variation between runs of the benchmark to be expected?

Any suggestions about where to look to reduce the variations between runs?

Tom

BM25Similarity runs where "my_modified_version" is LUCENE-


 tail -33 BM25SimRun1 |head -5
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm   87.91 (13.2%)   81.02  (8.5%)   
-7.8% ( -26% -   16%)
 MedTerm  111.81 (13.2%)  103.11  (8.4%)   
-7.8% ( -25% -   15%)
 LowTerm  411.44 (17.7%)  382.47 (14.5%)   
-7.0% ( -33% -   30%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun2 |head -5
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm   62.15  (6.4%)   58.10  (7.1%)   
-6.5% ( -18% -7%)
 MedTerm  139.11  (4.5%)  130.22  (7.5%)   
-6.4% ( -17% -5%)
 LowTerm  391.93 (10.5%)  373.71 (13.1%)   
-4.6% ( -25% -   21%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun3 |head -5
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm   54.85  (6.5%)   50.18  (1.6%)   
-8.5% ( -15% -0%)
 MedTerm  146.04  (8.6%)  137.31  (4.7%)   
-6.0% ( -17% -8%)
OrNotHighLow   45.85 (11.1%)   43.37 (10.6%)   
-5.4% ( -24% -   18%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun4 |head -5
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
OrNotHighMed   49.40  (8.7%)   45.37  (8.8%)   
-8.2% ( -23% -   10%)
OrNotHighLow   65.48  (8.7%)   60.19  (9.0%)   
-8.1% ( -23% -   10%)
   OrNotHighHigh   37.06  (8.2%)   34.18  (8.2%)   
-7.8% ( -22% -9%)

==
Default similarity, which is not modified by the BM25 patch

DefaultSimRun1
 LowTerm  398.97 (17.9%)  398.94 (18.1%)   
-0.0% ( -30% -   43%)
HighTerm   21.13 (12.1%)   21.45 (12.2%)
1.5% ( -20% -   29%)
DefaultSimRun2
 LowTerm  406.93 (17.1%)  381.51 (15.8%)   
-6.2% ( -33% -   32%)
HighTerm   59.21  (2.5%)   59.70  (3.5%)
0.8% (  -5% -7%)
DefaultSimRun3
 LowTerm  431.59 (18.5%)  450.55 (16.8%)
4.4% ( -26% -   48%)
HighTerm   76.45  (2.0%)   76.45  (1.7%)
0.0% (  -3% -3%)



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-15 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741582#comment-13741582
 ] 

Tom Burton-West commented on LUCENE-5175:
-

Hi Robert,

I tried running luceneutils with the default wikimedium10m collection and 
tasks.   I ran it first on the DefaultSimilarity, which shouldn't be affected 
by the patch to BM25Similarity and it showed about -2.3% difference.  I'm 
guessing there is some inaccuracy in the tests.   When I changed 
DEFAULT_SIMILARITY to BM25Similarity, the worst change was a difference of 
-8.8%.  

Is there a separate mailing list for questions about luceneutils or should I 
write to the java-dev list? or directly to Mike or you?

Tom

> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-14 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-5175:


Comment: was deleted

(was: I downloaded the luceneutils benchmark suite and the enwiki data and 
tried to run the out-of-the-box demo.  I need to ask our sysadmins to upgrade 
python versions on our dev machines. 

Tom )

> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>    Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740186#comment-13740186
 ] 

Tom Burton-West edited comment on LUCENE-5175 at 8/14/13 9:09 PM:
--

I downloaded the luceneutils benchmark suite and the enwiki data and tried to 
run the out-of-the-box demo.  I need to ask our sysadmins to upgrade python 
versions on our dev machines. 

Tom 

  was (Author: tburtonwest):
I downloaded the luceneutils benchmark suite and the enwiki data and tried 
to run the default out-of-the-box demo that gets copied into localrun.py and 
got the errors below.   I suspect maybe the python version on our dev machine 
is old or I need to set up some env variables.

Is there anything obvious in the message below?   Should I be asking questions 
about benchmark on the dev list or is there a separate list for that?

Tom

python localrun.py 
Traceback (most recent call last):
  File "localrun.py", line 18, in ?
from competition import Competition
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/competition.py", line 19, 
in ?
import searchBench
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/searchBench.py", line 25, 
in ?
import benchUtil
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/benchUtil.py", line 761
with open(fullLogFile, 'rb') as f:

  
> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740186#comment-13740186
 ] 

Tom Burton-West commented on LUCENE-5175:
-

I downloaded the luceneutils benchmark suite and the enwiki data and tried to 
run the default out-of-the-box demo that gets copied into localrun.py and got 
the errors below.   I suspect maybe the python version on our dev machine is 
old or I need to set up some env variables.

Is there anything obvious in the message below?   Should I be asking questions 
about benchmark on the dev list or is there a separate list for that?

Tom

python localrun.py 
Traceback (most recent call last):
  File "localrun.py", line 18, in ?
from competition import Competition
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/competition.py", line 19, 
in ?
import searchBench
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/searchBench.py", line 25, 
in ?
import benchUtil
  File "/htsolr/lss-dev/data/4/LuceneBench/bench/util/benchUtil.py", line 761
with open(fullLogFile, 'rb') as f:


> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-13 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738818#comment-13738818
 ] 

Tom Burton-West commented on LUCENE-5175:
-

I wondered about that "crazy cache", in that it makes the implementation 
dependent on the norms implementation.  

BTW: It looks to me with Lucene's default norms that there are only about 130 
or so "document lengths".  If there is no boosting going on the byte value has 
to get to 124 for a doclenth = 1, so there are only 255-124 =131 possible 
different lengths.

i=124 norm=1.0,doclen=1.0

> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>      Components: core/search
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-13 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738627#comment-13738627
 ] 

Tom Burton-West commented on LUCENE-5175:
-

Thanks Robert,

In the article, they claim that the change doesn't have a performance impact.  
On the other hand, I'm not familiar enough with Java performance to be able to 
eyeball it,  and it looks to me like we added one or more floating point 
operations, so it would be good to benchmark, especially since the scoring alg 
gets run against every hit, and we might have millions of hits for a poorly 
chosen query. (And if we switch to page-level indexing we could have hundreds 
of millions of hits).

I was actually considering making it a subclass instead of just modifying 
BM25Similarity, so that it would be easy to benchmark, and if it turns out that 
there is a significant perf difference, that users could choose which 
implementation to use.   I saw that computeWeight in BM25Similarity was final 
and decided I didn't know enough about why this is final to either refactor to 
create a base class, or change the method  in order to subclass.

Is luceneutil the same as lucene benchmark?   I've been wanting to learn how to 
use lucene benchmark for some time.  

Tom


> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-13 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-5175:


Attachment: LUCENE-5175.patch

Patch adds optional parameter delta to lower-bound tf normalization.  Attached 
also are unit tests. 

Still need to add tests of the explanation/scoring for cases 1) no norms, and 
2) no delta

If no delta parameter is supplied, the math works out to the equivalent of the 
regular BM25 formula  as far as the score, but I think there is an extra step 
or two to get there.  I'll see if I can get some benchmarks running to see if 
there is any significant performance issue.

> Add parameter to lower-bound TF normalization for BM25 (for long documents)
> ---
>
> Key: LUCENE-5175
> URL: https://issues.apache.org/jira/browse/LUCENE-5175
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>    Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-5175.patch
>
>
> In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
> problem is documented.  There was a TODO note in BM25Similarity to add this 
> fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5175) Add parameter to lower-bound TF normalization for BM25 (for long documents)

2013-08-13 Thread Tom Burton-West (JIRA)
Tom Burton-West created LUCENE-5175:
---

 Summary: Add parameter to lower-bound TF normalization for BM25 
(for long documents)
 Key: LUCENE-5175
 URL: https://issues.apache.org/jira/browse/LUCENE-5175
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Tom Burton-West
Priority: Minor


In the article "When Documents Are Very Long, BM25 Fails!" a fix for the 
problem is documented.  There was a TODO note in BM25Similarity to add this 
fix. I will attach a patch that implements the fix shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4763) Performance issue when using group.facet=true

2013-07-26 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13720994#comment-13720994
 ] 

Tom Burton-West commented on SOLR-4763:
---

I have similar problems with performance, but in my case memory use is an issue 
as well. This is probably an extreme use case, but I thought it might be 
helpful to add to the discussion.

We currently index close to 11 million books with the entire book being a Solr 
document.  We are considering instead indexing pages as the Solr document and 
using grouping to return results organized by book.

I'm currently testing an index of about 1 million books indexed on a page 
level, spread out over 3 shards.  There are about 360 million pages.  For a 
worst-case query that returns about 200 million documents, group.truncate takes 
about 10 seconds (which is acceptable for us as a worst-case).  However, 
group.facet takes closer to 15 minutes.  We are running on a server with 74GB 
of memory with 32GB dedicated to the JVM.  What I see for this query with 
group.facet is that memory use goes up above about 30GB and then multiple full 
garbage collections occur.  

In contrast, using normal rather than the worst case queries, our 90th 
percentile queries (which return only a few million hits rather than 200 
million) took about 700 ms with facet.truncate and 2000 ms with group.facet.

I'm wondering how much of the performance issues others are observing might be 
due to memory requirements and slowdowns due to garbage collection.

Tom


> Performance issue when using group.facet=true
> -
>
> Key: SOLR-4763
> URL: https://issues.apache.org/jira/browse/SOLR-4763
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2
>Reporter: Alexander Koval
>
> I do not know whether this is bug or not. But calculating facets with 
> {{group.facet=true}} is too slow.
> I have query that:
> {code}
> "matches": 730597,
> "ngroups": 24024,
> {code}
> 1. All queries with {{group.facet=true}}:
> {code}
> "QTime": 5171
> "facet": {
> "time": 4716
> {code}
> 2. Without {{group.facet}}:
> * First query:
> {code}
> "QTime": 3284
> "facet": {
> "time": 3104
> {code}
> * Next queries:
> {code}
> "QTime": 230,
> "facet": {
> "time": 76
> {code}
> So I think with {{group.facet=true}} Solr doesn't use cache to calculate 
> facets.
> Is it possible to improve performance of facets when {{group.facet=true}}?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4938) Solr should be able to use Lucene's BlockGroupingCollector for field-collapsing

2013-06-18 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-4938:
-

 Summary: Solr should be able to use Lucene's 
BlockGroupingCollector for field-collapsing
 Key: SOLR-4938
 URL: https://issues.apache.org/jira/browse/SOLR-4938
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.3.1
Reporter: Tom Burton-West
Priority: Minor


In Lucene it is possible to use the BlockGroupingCollector  for grouping in 
order to take advantage of indexing document blocks: 
IndexWriter.addDocuments().   With SOLR-3076 and SOLR-3535, it is possible to 
index document blocks.   It would be nice to have an option to use the 
BlockGroupingCollector with Solr field-collapsing/grouping.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5065) Refactor TestGrouping.java to break TestRandom into separate tests

2013-06-18 Thread Tom Burton-West (JIRA)
Tom Burton-West created LUCENE-5065:
---

 Summary: Refactor TestGrouping.java to break TestRandom into 
separate tests
 Key: LUCENE-5065
 URL: https://issues.apache.org/jira/browse/LUCENE-5065
 Project: Lucene - Core
  Issue Type: Test
  Components: modules/grouping
Affects Versions: 4.3.1
Reporter: Tom Burton-West
Priority: Minor


 lucene/grouping/src/test/org/apache/lucene/search/grouping
TestGrouping.java  combines multiple tests inside of one test: TestRandom(). 
This makes it difficult to understand or for new users to use the 
TestGrouping.java as an entry to understanding grouping functionality.

Either break TestRandom into separate tests or add small separate tests for the 
most important parts of TestRandom.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3076) Solr should support block joins

2013-06-11 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3076:
--

Attachment: SOLR-3076.patch

Patch against trunk (SVN style, patch -p0) that adds testXML(), which 
illustrates XML block indexing syntax and exercises the XMLLoader

> Solr should support block joins
> ---
>
> Key: SOLR-3076
> URL: https://issues.apache.org/jira/browse/SOLR-3076
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
> Fix For: 5.0, 4.4
>
> Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, 
> bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, 
> child-bjqparser.patch, dih-3076.patch, dih-config.xml, 
> parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 
> 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-7036-childDocs-solr-fork-trunk-patched, 
> solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, 
> tochild-bjq-filtered-search-fix.patch
>
>
> Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr field-collapsing should be able to use Lucene's BlockGroupingCollector

2013-06-11 Thread Tom Burton-West
In Lucene it is possible to use the BlockGroupingCollector  for grouping in
order to take advantage of indexing document blocks (
IndexWriter.addDocuments()
).   With SOLR-3076 and SOLR-3535, it is possible to index document blocks.
  I would like to have an option to use the BlockGroupingCollector with
Solr field-collapsing/grouping.Should I open a JIRA issue, or is there
something that would prevent using Lucene's BlockGroupingCollector with
Solr?

Tom


Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

2013-06-05 Thread Tom Burton-West
Hi Mike,

13 Billion unique terms.  (CheckIndex output appended below)

Tom
--

 test: terms, freq, prox...OK [13,068,302,002 terms; 187,284,275,343
terms/docs pairs; 786,014,075,745 tokens]

Segments file=segments_6 numSegments=2 version=4.0.0.2 format=
userData={commitTimeMSec=1357596564850}
  1 of 2: name=_uhj docCount=866984
codec=Lucene40
compound=false
numFiles=10
size (MB)=2,048,537.68
diagnostics = {os=Linux, os.version=2.6.18-308.24.1.el5, mergeFactor=8,
source=merge, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06 03:00:40,
os.arch=amd64, mergeMaxNumSegments=1, java.version=1.6.0_16,
java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields..OK [92 fields]
test: field norms.OK [46 fields]
test: terms, freq, prox...OK [13068302002 terms; 187284275343
terms/docs pairs; 786014075745 tokens]
test: stored fields...OK [34172522 total field count; avg 39.415
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]
test: DocValuesOK [0 total doc Count; Num DocValues Fields 0



On Tue, Jun 4, 2013 at 1:00 PM, Tom Burton-West  wrote:

> Thanks Mike.
>
> I'm running CheckIndex on the 2TB index right now.Hopefully it will
> finish running by tomorrow.  I'll send you a copy of the output.
>
> Tom
>
>
> On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Tom,
>>
>> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West 
>> wrote:
>>
>> > What is the current limit?
>>
>> I *think* (but would be nice to hear back how many terms you were able
>> to index into one segment ;) ) there is no hard limit to the max
>> number of terms, now that FSTs can handle more than 2.1 B
>> bytes/nodes/arcs.
>>
>> I'll update those javadocs, thanks!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>


Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

2013-06-04 Thread Tom Burton-West
Thanks Mike.

I'm running CheckIndex on the 2TB index right now.Hopefully it will
finish running by tomorrow.  I'll send you a copy of the output.

Tom


On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Tom,
>
> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West 
> wrote:
>
> > What is the current limit?
>
> I *think* (but would be nice to hear back how many terms you were able
> to index into one segment ;) ) there is no hard limit to the max
> number of terms, now that FSTs can handle more than 2.1 B
> bytes/nodes/arcs.
>
> I'll update those javadocs, thanks!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

2013-06-03 Thread Tom Burton-West
Hello,

The current documentation for Lucene 4.3 file formats says

When referring to term numbers, Lucene's current implementation uses a Java
int to hold the term index, which means the maximum number of unique terms
in any single index segment is ~2.1 billion times the term index interval
(default 128) = ~274 billion. This is technically not a limitation of the
index file format, just of Lucene's current implementation.

(
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Limitations
)

I believe that the termIndexInterval is not used in the default codec:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
 and instead the terms index is now in an FST.

So the above limit does not apply to the default codec.
What is the current limit?

I suspect it may be related to the maximum number of nodes in the FST, but
I don't know what that is or how it would translate to number of unique
terms, since prefix sharing among terms probably affects the number of
nodes in the FST.

Tom.


Re: SOLR-3076 and IndexWriter.addDocuments()

2013-05-20 Thread Tom Burton-West
Found it.  In AddBlockUpdateTest.testSmallBlockDirect

 assertEquals(2, h.getCore().getUpdateHandler().addBlock(cmd));
and in the patched code DirectUpdateHandler2.addBlock()

Tom


On Mon, May 20, 2013 at 5:49 PM, Tom Burton-West  wrote:

> My understanding of Lucene Block-Join indexing is that at some point
> IndexWriter.addDocuments() or IndexWriter.updateDocuments() need to be
> called to actually write a block of documents to disk.
>
>I'm trying to understand how SOLR-3076 (Solr should support block
> joins), works and haven't been able to trace out how or where it calls
> IndexWriter.addDocuments() or IndexWriter.updateDocuments.
>
> Can someone point me to the right place in the patch code?
>
> (If I should be asking this in the JIRA instead of the dev list please let
> me know)
>
> Tom
>


SOLR-3076 and IndexWriter.addDocuments()

2013-05-20 Thread Tom Burton-West
My understanding of Lucene Block-Join indexing is that at some point
IndexWriter.addDocuments() or IndexWriter.updateDocuments() need to be
called to actually write a block of documents to disk.

   I'm trying to understand how SOLR-3076 (Solr should support block
joins), works and haven't been able to trace out how or where it calls
IndexWriter.addDocuments() or IndexWriter.updateDocuments.

Can someone point me to the right place in the patch code?

(If I should be asking this in the JIRA instead of the dev list please let
me know)

Tom


[jira] [Updated] (SOLR-3076) Solr should support block joins

2013-05-17 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3076:
--

Attachment: SOLR-7036-childDocs-solr-fork-trunk-patched

Thanks Vadim,

I haven't used SolrJ, so I needed to translate to the XMLLoader/XML update 
handler.

I pulled your trunk-patched version, added an XMLLoader test to 
AddBlockUpdateTest. It's a brain-dead copy of the testSolrJXML test.  I don't 
know if it is testing much, but at least for Solr users like me who are 
unfamiliar with SolrJ, it provides an executable example of the XML syntax 
currently being used.

p.s.
The attached patch is a git diff to your version. ( I don't quite know how to 
make a correct patch against the right version of Solr trunk)  



> Solr should support block joins
> ---
>
> Key: SOLR-3076
> URL: https://issues.apache.org/jira/browse/SOLR-3076
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
> Fix For: 5.0, 4.4
>
> Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, 
> bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, 
> child-bjqparser.patch, dih-3076.patch, dih-config.xml, 
> parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 
> 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-7036-childDocs-solr-fork-trunk-patched, 
> solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, 
> tochild-bjq-filtered-search-fix.patch
>
>
> Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3076) Solr should support block joins

2013-05-16 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660124#comment-13660124
 ] 

Tom Burton-West commented on SOLR-3076:
---


I'd like to test this out with some real data and would like to use the 
XmlUpdateRequestHandler.  Since  SOLR-3535 was folded in here, I looked here to 
try to find the XML syntax to use. I couldn't tell from a quick read of the 
code what the XML syntax would be to actually use to add a parent and children. 
  Would it be possible to add a test similar to 
solr/core/src/test/org.apache.solr.handler/XmlUpdateRequestHandlerTest?

I would assume all that the xml in XmlUpdateRequestHandlerTest could be 
replaced with the proper xml to index a block consisting of a parent and its 
children

i.e in the test replace:
String xml = 
  "" +
  "  12345" +
  "  kitten" +
  "  aaa" +
  "  bbb" +
  "  bbb" +
  "  a&b" +
  "";

with whatever xml is needed to index a block (parent and children).



> Solr should support block joins
> ---
>
> Key: SOLR-3076
> URL: https://issues.apache.org/jira/browse/SOLR-3076
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
> Fix For: 5.0, 4.4
>
> Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, 
> bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, 
> child-bjqparser.patch, dih-3076.patch, dih-config.xml, 
> parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 
> 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
> solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, 
> tochild-bjq-filtered-search-fix.patch
>
>
> Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Help working with patch for SOLR-3076 (Block Joins)

2013-05-16 Thread Tom Burton-West
Thanks Shawn and Vadim,

I'll try the July patch against  r1351040 of 4_x for now.
 Vadim, I'm in no hurry, but I'll watch 3076 for your patch and work with
that when you post it.


Tom


On Thu, May 16, 2013 at 2:14 AM, Vadim Kirilchuk <
vkirilc...@griddynamics.com> wrote:

> Hi,
>
> As far as i know, patch from 16/Jul/12 was created for branch 4.x, and
> SOLR-3076-childDocs.patch
> from 12/Oct/12 is a little bit reworked SOLR-3076 (for branch 4.x too).
>
> However, they may be not up to date even for 4.x, because of trunk back
> merges (i'am not sure).
>
> Also as mentioned by Shawn, you should use p1 instead of p0.
>
> P/s actually i have reworked version for trunk, i can post it in a week if
> you need.
>
> On Thu, May 16, 2013 at 3:46 AM, Shawn Heisey  wrote:
>
>> On 5/15/2013 5:42 PM, Shawn Heisey wrote:
>>
>>> Through a little detective work, I figured out that it would apply
>>> cleanly to revision 1351040 of trunk.  When I then tried to do 'svn up'
>>> to bring the tree current, there were merge conflicts that will have to
>>> be manually fixed.
>>>
>>
>> It applied also to that revision of branch_4x, and I think there were
>> fewer merge conflicts there, too.  It looks like you want 4x, so that's
>> probably a good thing.
>>
>>
>> Thanks,
>> Shawn
>>
>>
>> --**--**-
>> To unsubscribe, e-mail: 
>> dev-unsubscribe@lucene.apache.**org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>


Help working with patch for SOLR-3076 (Block Joins)

2013-05-15 Thread Tom Burton-West
Hello,

I would like to build Solr with  the July 12th Solr-3076 patch.   How do I
determine what version/revision of Solr I need to check out to build this
patch against?   I tried using the latest branch_4x and got a bunch of
errors.  I suspect I need an earlier revision or maybe trunk.  This patch
also seems to be created with git instead of svn, so maybe I am doing
something wrong.  (Error messages appended below)


Tom


branch_4x
Checked out revision 1483079

patch -p0 -i SOLR-3076.patch --dry-run
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git a/lucene/module-build.xml b/lucene/module-build.xml
|index 62cfd96..2a746fc 100644
|--- a/lucene/module-build.xml
|+++ b/lucene/module-build.xml
--

Tried again with the git patch command
 patch -p1 -i SOLR-3076.patch --dry-run
patching file lucene/module-build.xml
Hunk #1 succeeded at 433 (offset 85 lines).
patching file solr/common-build.xml
Hunk #1 FAILED at 89.
Hunk #2 FAILED at 134.
Hunk #3 FAILED at 151.
3 out of 3 hunks FAILED -- saving rejects to file solr/common-build.xml.rej
patching file
solr/core/src/java/org/apache/solr/handler/UpdateRequestHandler.java
...
lots of more failed messages


Re: Ability to specify 2 different query analyzers for same indexed field in Solr

2013-03-07 Thread Tom Burton-West
Thanks Jan,

The blog post is very good, I didn't quite realize all those various
pitfalls with synonyms.

  I would still like the ability to specify two different query analysis
chains with one index, rather than having to write a custom parser for each
use case.   For example the Traditional/Simplified Chinese use case in my
previous message could probably be solved with a custom query parser along
the lines of the synonym solution in the blog post but if there were a way
to specify two different query analysis chains for the same indexed field,
I would not have to write a custom query parser.

Tom



On Tue, Mar 5, 2013 at 5:39 PM, Jan Høydahl  wrote:

> Hi,
>
> Please have a look at
> http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ and a
> working plugin to Solr to deboost the expanded synonyms. The plugin code
> currently lacks ability to configure different dictionaries for each field,
> but that could be added. Also see SOLR-4381 for eventual inclusion in Solr.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 5. mars 2013 kl. 17:26 skrev Tom Burton-West :
>
> Thanks Erick,
>
> Payloads might work but I'm looking at a more general problem
>
> Here is another use case:
>
> We have a mix of Traditional and Simplified Chinese documents indexed in
> the same OCR field.
>  When a user searches using Traditional Chinese, I would like to also
> search in Simplified Chinese, but rank the results matching Traditional
> Chinese higher.   Similarly, if a user enters a query in Simplified
> Chinese, I want to also search in Traditional Chinese but rank matches of
> the Simplified Chinese query terms higher.
>
> Since it is not always possible to determine whether a short query is in
> Simplified or Traditional Chinese here is what I would like to do.
>
> 1) Convert the query to Traditional Chinese
> 2) Convert the query to Simplified Chinese
> (One of these two steps would not be necessary if I could reliably
> determine the nature of the query)
>
> q1=QueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.
>
> Again, this could be done with copy fields, but that would increase my
> index size too much.  What I really want to be able to do is to query the
> same index (i.e. document as created ) with the user's query
> processed/analyzed in 3 different ways.
>
> I could do this myself in the app layer, but I would really like to be
> able to use Solr.
>
>
> Tom
>
>
>
> On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson wrote:
>
>> Tom:
>>
>> I wonder if you could do something with payloads here. Index all terms
>> with payloads of 10, but synonyms with 1?
>>
>> Random thought off the top of my head.
>>
>> Erick
>>
>>
>>> 
>>>
>>>   
>>> 
>>> 
>>> 
>>>
>>>   
>>> 
>>>
>>> 
>>> 
>>>
>>>   
>>> 
>>> 
>>> 
>>>
>>>>> ignoreCase="true" expand="true"/>
>>>   
>>> 
>>> 
>>>
>>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky 
>>> wrote:
>>>
>>>>   Please clarify, and try providing a couple more use cases. I mean,
>>>> the case you provided suggests that the contents of the index will be
>>>> different between the two fields, while you told us that you wanted to
>>>> share the same indexed field. In other words, it sounds like you will have
>>>> two copies of similar data anyway.
>>>>
>>>> Maybe you simply want one copy of the stored value for the field and
>>>> then have one or more copyfields that index the same source data
>>>> differently, but don’t re-store the copied source data.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Tom Burton-West 
>>>> *Sent:* Monday, March 04, 2013 3:57 PM
>>>> *To:* dev@lucene.apache.org
>>>> *Subject:* Ability to specify 2 different query analyzers for same
>>>> indexed field in Solr
>>>>
>>>> Hello,
>>>>
>>>> We would like to be able to specify two different fields that both use
>>>> the same indexed field but use different analyzers.   An example use-case
>>>> for this might be doing query-time synonym expansion with the synonyms
>>>> weighted lower than an exact match.
>>>>
>>>> q=exact_field^10 OR synonyms^1
>>>>
>>>> The normal way to do this in Solr, which is just to set up separate
>>>> analyzer chains and use a copyfield, will not work for us because the field
>>>> in question is huge.  It is about 7 TB of OCR.
>>>>
>>>> Is there a way to do this currently in Solr?   If not ,
>>>>
>>>> 1) should I open a JIRA issue?
>>>> 2) can someone point me towards the part of the code I might need to
>>>> modify?
>>>>
>>>> Tom
>>>>
>>>>  Tom Burton-West
>>>> Information Retrieval Programmer
>>>> Digital Library Production Service
>>>> University of Michigan Library
>>>> http://www.hathitrust.org/blogs/large-scale-search
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


Re: Ability to specify 2 different query analyzers for same indexed field in Solr

2013-03-05 Thread Tom Burton-West
Thanks Erick,

Payloads might work but I'm looking at a more general problem

Here is another use case:

We have a mix of Traditional and Simplified Chinese documents indexed in
the same OCR field.
 When a user searches using Traditional Chinese, I would like to also
search in Simplified Chinese, but rank the results matching Traditional
Chinese higher.   Similarly, if a user enters a query in Simplified
Chinese, I want to also search in Traditional Chinese but rank matches of
the Simplified Chinese query terms higher.

Since it is not always possible to determine whether a short query is in
Simplified or Traditional Chinese here is what I would like to do.

1) Convert the query to Traditional Chinese
2) Convert the query to Simplified Chinese
(One of these two steps would not be necessary if I could reliably
determine the nature of the query)

q1=QueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.

Again, this could be done with copy fields, but that would increase my
index size too much.  What I really want to be able to do is to query the
same index (i.e. document as created ) with the user's query
processed/analyzed in 3 different ways.

I could do this myself in the app layer, but I would really like to be able
to use Solr.


Tom



On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson wrote:

> Tom:
>
> I wonder if you could do something with payloads here. Index all terms
> with payloads of 10, but synonyms with 1?
>
> Random thought off the top of my head.
>
> Erick
>
>
>> 
>>
>>   
>> 
>> 
>> 
>>
>>   
>> 
>>
>> 
>> 
>>
>>   
>> 
>> 
>> 
>>
>>> ignoreCase="true" expand="true"/>
>>   
>> 
>> 
>>
>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky 
>> wrote:
>>
>>>   Please clarify, and try providing a couple more use cases. I mean,
>>> the case you provided suggests that the contents of the index will be
>>> different between the two fields, while you told us that you wanted to
>>> share the same indexed field. In other words, it sounds like you will have
>>> two copies of similar data anyway.
>>>
>>> Maybe you simply want one copy of the stored value for the field and
>>> then have one or more copyfields that index the same source data
>>> differently, but don’t re-store the copied source data.
>>>
>>> -- Jack Krupansky
>>>
>>>  *From:* Tom Burton-West 
>>> *Sent:* Monday, March 04, 2013 3:57 PM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Ability to specify 2 different query analyzers for same
>>> indexed field in Solr
>>>
>>> Hello,
>>>
>>> We would like to be able to specify two different fields that both use
>>> the same indexed field but use different analyzers.   An example use-case
>>> for this might be doing query-time synonym expansion with the synonyms
>>> weighted lower than an exact match.
>>>
>>> q=exact_field^10 OR synonyms^1
>>>
>>> The normal way to do this in Solr, which is just to set up separate
>>> analyzer chains and use a copyfield, will not work for us because the field
>>> in question is huge.  It is about 7 TB of OCR.
>>>
>>> Is there a way to do this currently in Solr?   If not ,
>>>
>>> 1) should I open a JIRA issue?
>>> 2) can someone point me towards the part of the code I might need to
>>> modify?
>>>
>>> Tom
>>>
>>>  Tom Burton-West
>>> Information Retrieval Programmer
>>> Digital Library Production Service
>>> University of Michigan Library
>>> http://www.hathitrust.org/blogs/large-scale-search
>>>
>>>
>>>
>>
>>
>


Re: Ability to specify 2 different query analyzers for same indexed field in Solr

2013-03-04 Thread Tom Burton-West
Hi Jack,

Sorry the example is not clear.  Below is the normal way to accomplish what
I am trying to do using a copyField and two separate fieldTypes with the
index analyzer the same but the query time analyzer different.

So the query would be something like q=plain:foobar^10 OR syn:foobar^1  to
get synonyms but scored much lower than an exact match.

The problem with this is since the analysis chain used for indexing is the
same in both cases, I would rather not have to actually index the exact
same content in the exact same way twice.

Does that make it any clearer or do I need a more compelling use case?

Tom



   
  



   
  




   
  



   
   
  



On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky wrote:

>   Please clarify, and try providing a couple more use cases. I mean, the
> case you provided suggests that the contents of the index will be different
> between the two fields, while you told us that you wanted to share the same
> indexed field. In other words, it sounds like you will have two copies of
> similar data anyway.
>
> Maybe you simply want one copy of the stored value for the field and then
> have one or more copyfields that index the same source data differently,
> but don’t re-store the copied source data.
>
> -- Jack Krupansky
>
>  *From:* Tom Burton-West 
> *Sent:* Monday, March 04, 2013 3:57 PM
> *To:* dev@lucene.apache.org
> *Subject:* Ability to specify 2 different query analyzers for same
> indexed field in Solr
>
> Hello,
>
> We would like to be able to specify two different fields that both use the
> same indexed field but use different analyzers.   An example use-case for
> this might be doing query-time synonym expansion with the synonyms weighted
> lower than an exact match.
>
> q=exact_field^10 OR synonyms^1
>
> The normal way to do this in Solr, which is just to set up separate
> analyzer chains and use a copyfield, will not work for us because the field
> in question is huge.  It is about 7 TB of OCR.
>
> Is there a way to do this currently in Solr?   If not ,
>
> 1) should I open a JIRA issue?
> 2) can someone point me towards the part of the code I might need to
> modify?
>
> Tom
>
>  Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Service
> University of Michigan Library
> http://www.hathitrust.org/blogs/large-scale-search
>
>
>


Ability to specify 2 different query analyzers for same indexed field in Solr

2013-03-04 Thread Tom Burton-West
Hello,

We would like to be able to specify two different fields that both use the
same indexed field but use different analyzers.   An example use-case for
this might be doing query-time synonym expansion with the synonyms weighted
lower than an exact match.

q=exact_field^10 OR synonyms^1

The normal way to do this in Solr, which is just to set up separate
analyzer chains and use a copyfield, will not work for us because the field
in question is huge.  It is about 7 TB of OCR.

Is there a way to do this currently in Solr?   If not ,

1) should I open a JIRA issue?
2) can someone point me towards the part of the code I might need to modify?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search


default updateLog setting in Solr 4 example solrconfig.xml needs warning documentation for possible very large logs

2013-01-15 Thread Tom Burton-West
Hello all,

We have been using Solr 4.0 for a while and suddenly we couldn't get Solr
to come up.   As Solr was starting up it hung after opening a Searcher.
 There wasn't anything else obvious in the logs.  Eventually we realized
that the problem was that the updatelog was being read and that the update
log contained the entire text of all 800,000+ books that we indexed (About
837GB).

We looked and didn't find any obvious note in the Solr 4.0 Release notes on
upgrading from 3.6 or any documentation in the example solrconfig.xml that
mentioned that perhaps if you have large documents and you aren't using
real-time get, you may want to turn this off/comment this out to avoid
transactions logs that can exceed the size of your index.

In the latest 4.0 example/solrconfig.xml (r
*1433064)
, updateLog is enabled in the default Solr updateHandler by default and the
only comment is:*
*

*


 Some users who are either new to Solr or upgrading from earlier versions
of Solr may not understand whether or not they need "real-time get" and
they may not want to delve into the details of near- realtime search or
using Solr as an NoSQL server in order to determine whether they should
comment out the updateLog entry.

I think that either the updateLog should not be enabled by default (don't
know the pros and cons of this), or at the very least, something should
mention that this can lead to large transaction logs and there should be a
pointer to some documentation that would enable the user to decide whether
or not to enable/disable this.

Is there documentation of this in some obvious place that I just missed?

I did find the text below on the wiki
http://wiki.apache.org/solr/SolrConfigXml#Update_Handler_Section, but a
user-friendly translation would be helpful or a pointer to where someone
could read to determine what this means would be helpful.

false 

I did see that several new Solr 4 users created very large logs before they
asked the mailing list how to avoid this:
http://lucene.472066.n3.nabble.com/Documentation-on-the-new-updateLog-transaction-log-feature-tc4000537.html#a4000538

Perhaps some of the information in this thread on the mailing list might be
added to the documentation somewhere.

http://lucene.472066.n3.nabble.com/Testing-Solr4-first-impressions-and-problems-tc4013628.html#a4013814

I think I almost understand the
hard-commit/soft-commit/autocommit/opensearcher discussion in the above
thread and it would seem that this could be put in the wiki or the comments
in the config file as appropriate.

Should I open a JIRA issue?

Tom




Log entry.
"Jan 14, 2013 12:40:31 PM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening Searcher@59db9f45 main


[jira] [Commented] (LUCENE-2187) improve lucene's similarity algorithm defaults

2013-01-04 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543985#comment-13543985
 ] 

Tom Burton-West commented on LUCENE-2187:
-

Hi Robert,

Is this implementation made moot by the new GSOC work, or would it still be 
worth testing this as well as BM25, DFR and INF?

I can't seem to find a link to the ORP collections.  Can you point me to it?
(I plan to test with our long docs, but thought I would try out some of the ORP 
collections as well)


Tom

> improve lucene's similarity algorithm defaults
> --
>
> Key: LUCENE-2187
> URL: https://issues.apache.org/jira/browse/LUCENE-2187
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Robert Muir
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-2187.patch, scoring.pdf, scoring.pdf, scoring.pdf
>
>
> First things first: I am not an IR guy. The goal of this issue is to make 
> 'surgical' tweaks to lucene's formula to bring its performance up to that of 
> more modern algorithms such as BM25.
> In my opinion, the concept of having some 'flexible' scoring with good speed 
> across the board is an interesting goal, but not practical in the short term.
> Instead here I propose incorporating some work similar to lnu.ltc and 
> friends, but slightly different. I noticed this seems to be in line with that 
> paper published before about the trec million queries track... 
> Here is what I propose in pseudocode (overriding DefaultSimilarity):
> {code}
>   @Override
>   public float tf(float freq) {
> return 1 + (float) Math.log(freq);
>   }
>   
>   @Override
>   public float lengthNorm(String fieldName, int numTerms) {
> return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
>   }
> {code}
> Where slope is a constant (I used 0.25 for all relevance evaluations: the 
> goal is to have a better default), and pivot is the average field length. 
> Obviously we shouldnt make the user provide this but instead have the system 
> provide it.
> These two pieces do not improve lucene much independently, but together they 
> are competitive with BM25 scoring with the test collections I have run so 
> far. 
> The idea here is that this logarithmic tf normalization is independent of the 
> tf / mean TF that you see in some of these algorithms, in fact I implemented 
> lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
> stuff and it did not fare as well as this method, and this is simpler, we do 
> not need to calculate this mean TF at all.
> The BM25-like "binary" pivot here works better on the test collections I have 
> run, but of course only with the tf modification.
> I am uploading a document with results from 3 test collections (Persian, 
> Hindi, and Indonesian). I will test at least 3 more languages... yes 
> including English... across more collections and upload those results also, 
> but i need to process these corpora to run the tests with the benchmark 
> package, so this will take some time (maybe weeks)
> so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
> 3 test collections are in the openrelevance svn, so if you think you have a 
> great idea, don't hesitate to test it and upload results, this is what it is 
> for. 
> also keep in mind again I am not a scoring or IR guy, the only thing i can 
> really bring to the table here is the willingness to do a lot of relevance 
> testing!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: VOTE: release 3.6.2

2012-12-19 Thread Tom Burton-West
Thanks Robert,

Ok, I can see that logic.   People who want the "new feature" can just
apply the patch.

Tom

On Wed, Dec 19, 2012 at 5:53 PM, Robert Muir  wrote:

> On Wed, Dec 19, 2012 at 5:50 PM, Tom Burton-West 
> wrote:
> > Hi Robert,
> >
> > Would it be possible to fold in also LUCENE-4286?
> > I don't see the 3.6 backport listed in the JIRA issue, but it would be
> nice
> > to have that flag available for people still on the 3.6.x branch.
> >
>
> I think i would prefer not to: even though its just a "new
> option/feature", 3.6.x is basically in bugfix mode.
>
> I know this isn't great for people that want the feature, but i feel
> we should be really careful with these bugfix releases myself.
>
> Terms index bug is a whole different story though!
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: VOTE: release 3.6.2

2012-12-19 Thread Tom Burton-West
Hi Robert,

Would it be possible to fold in also LUCENE-4286?
I don't see the 3.6 backport listed in the JIRA issue, but it would be nice
to have that flag available for people still on the 3.6.x branch.

Tom

On Wed, Dec 19, 2012 at 3:46 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Dec 19, 2012 at 2:01 PM, Robert Muir  wrote:
> > On Wed, Dec 19, 2012 at 1:48 PM, Michael McCandless
> >  wrote:
> >> Hmm my smoke test run was angry about this javadocs warning:
> >>
> >>   [javadoc]
> /l/36x/tmp/unpack/apache-solr-3.6.2/lucene/contrib/facet/src/java/org/apache/lucene/facet/taxonomy/writercache/lru/NameIntCacheLRU.java:76:
> >> warning - @return tag has no arguments.
> >>
> >> It hit this when running javadocs with 1.7.0_07.
> >
> > Thanks Mike... can you fix? I have no idea why different versions of
> > java7 have different levels of pickiness.
> >
> > This is no issue for e.g. 4.x+, because we have the eclipse checker as
> > part of our build which fails on this. But for 3.x we don't have as
> > many tools unfortunately.
> >
> > I dont think we should put a lot of effort into this: but when
> > backporting bugfixes to old branches like this please be really
> > careful about this stuff.
>
> OK I committed that fix and a couple other javadocs warnings.
>
> Thanks for spinning 3.6.2 Robert!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

2012-11-29 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-4286:


Attachment: LUCENE-4286.patch_3.x

We are still using Solr 3.6 in production so I backported the patch to 
Lucene/Solr 3.6.  Attached as LUCENE-4286.patch_3.x

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 3.6.1
>    Reporter: Tom Burton-West
>Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch, 
> LUCENE-4286.patch_3.x
>
>
> Add an optional  flag to the CJKBigramFilter to tell it to also output 
> unigrams.   This would allow indexing of both bigrams and unigrams and at 
> query time the analyzer could analyze queries as bigrams unless the query 
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for 
> indexing with the "indexUnigrams" flag set and the analyzer for querying 
> without the flag. 
> 
> −
> 
>
> han="true"/>
> 
> 
>
>
> 
> 
> Use case: About 10% of our queries that contain Han characters are single 
> character queries.   The CJKBigram filter only outputs single characters when 
> there are no adjacent bigrammable characters in the input.  This means we 
> have to create a separate field to index Han unigrams in order to address 
> single character queries and then write application code to search that 
> separate field if we detect a single character Han query.  This is rather 
> kludgey.  With the optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
> used to allow single word queries (although that uses word n-grams rather 
> than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-11-07 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492591#comment-13492591
 ] 

Tom Burton-West commented on SOLR-3589:
---

Hi Robert,

I just put the backport to 3.6 up on our test server and pointed it to one of 
our production shards.  The improvement for Chinese queries  are dramatic.  
(Especially for longer queries like the TREC 5 queries, see examples below)

When you have time, please look over the backport of the patch.  I think it is 
fine but I would appreciate you looking it over.  My understanding of your 
patch is that it just affects a small portion of the edismax logic, but I don't 
understand the edismax parser well enough to be sure there isn't some 
difference between 3.6 and 4.0 that I didn't account for in the patch.

Thanks for working on this.   Naomi and I are both very excited about this bug 
finally being fixed and want to put the fix into production soon.
---
Example TREC 5 Chinese queries:

 Number: CH4
 The newly discovered oil fields in China.
 中国大陆新发现的油田   
40,135 items found for 中国大陆新发现的油田 with current implementation (due to dismax 
bug)
78 items found for 中国大陆新发现的油田 with patch

 Number: CH10
 Border Trade in Xinjiang
 新疆的边境贸易  
20,249 items found for 新疆的边境贸易  current implementation (with bug)
243 items found for 新疆的边境贸易  with patch.


> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>Reporter: Tom Burton-West
>Assignee: Robert Muir
> Attachments: SOLR-3589-3.6.PATCH, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, 
> testSolr3589.xml.gz, testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-11-07 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3589:
--

Attachment: SOLR-3589-3.6.PATCH

Backport to 3.6 r1406713. Includes synonyms test.

Will test in against production later today 

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
>Assignee: Robert Muir
> Attachments: SOLR-3589-3.6.PATCH, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, 
> testSolr3589.xml.gz, testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-11-07 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492487#comment-13492487
 ] 

Tom Burton-West commented on SOLR-3589:
---

Forgot to work from your latest patch with the synonyms test.   I'll post a new 
backport of the patch with the synonyms test and against the latest 3.6x in svn 
shortly

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>Reporter: Tom Burton-West
>Assignee: Robert Muir
> Attachments: SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, testSolr3589.xml.gz, 
> testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-11-06 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491922#comment-13491922
 ] 

Tom Burton-West commented on SOLR-3589:
---

I back-ported to 3.6 branch.  Forgot to change the name from SOLR-3589.patch, 
so the 6/Nov/12 patch is the 3.6 patch against yesterdays svn version of 3.6.

Main difference I saw between 3.6 and 4.0 is that Solr 4.0 uses 
DisMaxQParser.parseMinShouldMatch() to set the default at 0% if q.op=OR and 
%100 if q.op =AND

I just kept the 3.6 behavior which uses 3.6 default of 100% (if mm is not set)

I'll test the 3.6 patch against a production index tomorrow.
 



> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>Reporter: Tom Burton-West
>Assignee: Robert Muir
> Attachments: SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, testSolr3589.xml.gz, 
> testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-11-06 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3589:
--

Attachment: SOLR-3589.patch

Back-port to 3.6 branch

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
>Assignee: Robert Muir
> Attachments: SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, testSolr3589.xml.gz, 
> testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4023) Solr query parser does not correctly handle Boolean precedence

2012-10-31 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-4023:
-

 Summary: Solr query parser does not correctly handle Boolean 
precedence
 Key: SOLR-4023
 URL: https://issues.apache.org/jira/browse/SOLR-4023
 Project: Solr
  Issue Type: Bug
  Components: query parsers
Affects Versions: 4.0, 3.6
Reporter: Tom Burton-West


The default query parser in Solr does not handle precedence of Boolean 
operators in the way most people expect.
 
“A AND B OR C” gets interpreted as “A AND (B OR C)” . There are numerous other 
examples in the JIRA ticket for Lucene 167, this article on the wiki 
http://wiki.apache.org/lucene-java/BooleanQuerySyntax and in this blog post: 
http://robotlibrarian.billdueber.com/solr-and-boolean-operators/

See also : 
http://lucene.472066.n3.nabble.com/re-LUCENE-167-and-Solr-default-handling-of-Boolean-operators-is-broken-tc3552321.html#a3552416

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-23 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3589:
--

Attachment: testSolr3589.xml.gz

See above note

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
> Attachments: testSolr3589.xml.gz, testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-23 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440669#comment-13440669
 ] 

Tom Burton-West commented on SOLR-3589:
---

I'm not at the point where I understand the test cases for Edismax enough to 
write unit tests. If someone can point me to an example unit test somewhere 
that I could use to model a test please do.
  In the meantime, attached is a file which can be put in the Solr exampledocs 
directory and indexed.  Sample queries demonstrating the problem with English 
hyphenated words and with CJK are included 

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>Reporter: Tom Burton-West
> Attachments: testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-23 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3589:
--

Attachment: testSolr3589.xml.gz

File is gzipped. Unix line endings. Put document in solr/example/exampledocs.  
Queries listed in file.

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
> Attachments: testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-23 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440583#comment-13440583
 ] 

Tom Burton-West commented on SOLR-3589:
---

Just repeated tests in Solr 4.0Beta and the bug behaves the same.

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-23 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-3589:
--

Affects Version/s: 4.0-BETA

> Edismax parser does not honor mm parameter if analyzer splits a token
> -
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.6, 4.0-BETA
>    Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3753) Core admin and solr.xml documentation for 4.0 needs to be updated for 4.0 changes

2012-08-23 Thread Tom Burton-West (JIRA)
Tom Burton-West created SOLR-3753:
-

 Summary: Core admin and solr.xml documentation for 4.0 needs to be 
updated for 4.0 changes
 Key: SOLR-3753
 URL: https://issues.apache.org/jira/browse/SOLR-3753
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0-BETA
Reporter: Tom Burton-West


The existing documentation on Solr Cores needs to be updated to reflect changes 
in Solr 4.0

If having at least one solr core declared is mandatory for Solr 4.0, that needs 
to be stated in the release notes, in the example solr.xml file, and on the 
wiki page for CoreAdmin. http://wiki.apache.org/solr/CoreAdmin.

In the absence of a solr.xml file, current 4.0 behavior is to use defaults 
declared in  CoreContainer.java.  This needs to be documented; probably in 
solr.xml and/or on the CoreAdmin page.  (See line 94 of CoreAdmin.java where 
the default name "collection1" is declared.  Without this documentation, users 
can get confused about where the "collection1" core name is coming from. (I'm 
one of them).

The solr.xml file states that paths are relative to the "installation 
directory" This needs to be clarified.  In addition it appears that currently 
relative paths specified using "." or ".." are interpreted as string literals.  
If that is not a bug, than this behavior needs to be documented.  If it is a 
bug, please let me know and I'll open another issue.
 
The  example/solr/README.txt  Needs to clarify which files need to be in Solr 
Home and which files are mandatory or optional in the directories containing 
configuration files (and data files) for Solr cores.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr 4.0 Beta Documentation issues: Is it mandatory with 4.0 to run at least one core? /example/solr/README.txt needs updating

2012-08-23 Thread Tom Burton-West
Hello,

The CoreAdmin wiki page (http://wiki.apache.org/solr/CoreAdmin) implies
that setting up at least one core is not mandatory and neither is
solr.xml.   However when trying to migrate from 3.6 to 4.0 beta, I got a
message in the admin console: "There are no SolrCores running — for the
current functionality we require at least one SolrCore, sorry :)"

Here are a few questions that probably need to be cleared up in the
documentation.
1) Is running at least one core required or is the message above referring
to some admin console functionality that wont work without at least one
core?   If running at least one core is required, perhaps this needs also
to go in the Release notes/Changes.

2) The README.txt file in example/solr/README.txt  needs revision.  Is
example/solr  the Solr Home directory?  If so what is the relationship
between Solr Home and subdirectories for different cores?   Do only lib
files go in /example/solr/lib  or in example/solr/collection1/lib
Which files/directories are shared by cores and which need to be in
separate core directories?

I'll be happy to add these as a comment to Solr 3288 if that is
appropriate.  Please let me know.

Thanks for all your work on Solr 4!

Tom


Re: [jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

2012-08-17 Thread Tom Burton-West
I just wanted to mention that there is not only a problem with mm=100% but
also with other values of mm where the number of tokens resulting from
splitting CJK or otherwise is within an mm limit.

For example with mm=2 the query [fire fly] and query [fire-fly]  which with
WDF gets split into two tokens, should both end up with both words being
required to match.  However, when the split occurs  [fire-fly] gets split
into two tokens and then treated as "fire" OR "fly"

Tom


Re: [jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

2012-08-09 Thread Tom Burton-West
Regardless of how you change or don't change the examples, I just want to
put in a plug for better documentation.  A number of Solr users were hit by
suprise when the default was changed in Solr/Lucene 3.5.  I tried to find
out how to modify/change the release notes to call attention to this but
gave up too soon.  See:
http://lucene.472066.n3.nabble.com/autoGeneratePhraseQueries-sort-of-silently-set-to-false-tc3770638.html
Tom Burton-West
On Thu, Aug 9, 2012 at 1:25 PM, Yonik Seeley (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432003#comment-13432003]
>
> Yonik Seeley commented on SOLR-3723:
> 
>
> bq. I think apps that want this behaviour should simply use
> text_en_splitting. That's why we have that field type.
>
> We could also create a text_en_pureOr (or whatever name fits better) field
> type that always interpreted a-b as (a OR B) and then apps that want that
> behavior could use that.
>
> But we're also talking about what the best default for english (i.e.
> text_en) in general is.
> The defaults for "text" in general are a different question.  Looking at
> all of the arguments so far, my judgement is still that for text_en,
> interpreting a-team as "a team" is far preferable to (a OR team)
>
>
> > Improve OOTB behavior: English word-splitting should default to
> autoGeneratePhraseQueries=true
> >
> --
> >
> > Key: SOLR-3723
> > URL: https://issues.apache.org/jira/browse/SOLR-3723
> > Project: Solr
> >  Issue Type: Improvement
> >  Components: Schema and Analysis
> >Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1
> >Reporter: Jack Krupansky
> >
> > Digging through the Jira and revision history, I discovered that back at
> the end of May 2011, a change was made to Solr that fairly significantly
> degrades the OOTB behavior for English Solr queries, namely for
> word-splitting of terms with embedded punctuation, so that they end up, by
> default, doing the OR of the sub-terms, rather than doing the obvious
> phrase query of the sub-terms.
> > Just a couple of examples:
> > 1. CD-ROM => CD OR ROM rather than “CD ROM”
> > 2. 1,000 => 1 OR 000 rather than “1 000” (when using the
> WordDelimiterFilter innocently added to text_general or text_en)
> > 3. out-of-the-box => out OR of OR the OR box rather than “out of the box”
> > 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter
> innocently added to text_general or text_en)
> > 5. docid-001 => docid OR 001 rather than "DOCID 001"
> > All of those queries will give surprising and unexpected results.
> > Note: The hyphen issue is present in StandardTokenizer, even if WDF is
> not used. Side note: The full behavior of StandardTokenizer should be more
> fully documented on the Analyzers wiki.
> > Back to the history of the change, there was a lot of lively discussion
> on SOLR-2015 - add a config hook for autoGeneratePhraseQueries.
> > And the actual change to default to the behavior described above was
> SOLR-2519 - improve defaults for text_* field types.
> > (Consider the entire discussion in those two issues incorporated here
> for reference. Anyone wishing to participate in discussion on this issue
> would be well-advised to study those two issues first.)
> > I gather that the original motivation was for non-European languages,
> and that even some European languages might search better without
> auto-phrase generation, but the decision to default English terms to NOT
> automatically generate phrase queries and to generate OR queries instead is
> rather surprising and unexpected and outright undesirable, as my examples
> above show.
> > I had been aware of the behavior for quite some time, but I had thought
> it was simply a lingering bug so I paid little attention to it, until I
> stumbled across this autoGeneratePhraseQueries "feature" while looking at
> the query parser code. I can understand the need to disable automatic
> phrase queries for SOME languages, but to disable it by default for English
> seems rather bizarre, as my simple use cases above show.
> > Even if no action is taken on this Jira, I feel that it is important
> that there be a wider awareness of the significant and unexpected impact
> from SOLR-2519, and that what had seemed like buggy behavior was done
> intentionally.
> > Unless there has been a change of hear

[jira] [Commented] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

2012-08-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431297#comment-13431297
 ] 

Tom Burton-West commented on LUCENE-4286:
-

Thanks Robert for all your work on non-English searching and for your quick 
response on this issue.

>>If you do unigrams and bigrams in separate fields, you can bias bigrams over 
>>unigrams.
That was our original intention. 

>>The combined unigram+bigram technique is a general technique, which I think 
>>is useful to support. ...Tom would have to do tests for his "index-time-only" 
>>approach: I can't speak for that.

Originally I was going to use the combined unigram+bigram technique (with a 
boost for the bigram fields) and wrote some custom code to implement it.  
However, I started thinking about the size of our documents. With one 
exception, all the literature I found that got better results with a 
combination of bigrams and unigrams used newswire size documents (somewhere in 
the range of a few hundred words).  Our documents are several orders of 
magnitude larger (around 100,000 words).  

My understanding is that the main reason adding unigrams to bigrams increases 
relevance is that often the unigram will have a related meaning to the larger 
word.  So using unigrams is somewhat analogous to decompounding or stemming.  I 
haven't done any tests, but my guess is that with our very large documents the 
additional recall added by unigrams will be offset by a decrease in precision.

After I get a test suite set up for relevance ranking in English, I'll take a 
look at testing CJK :)

Tom

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 3.6.1
>Reporter: Tom Burton-West
>Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch
>
>
> Add an optional  flag to the CJKBigramFilter to tell it to also output 
> unigrams.   This would allow indexing of both bigrams and unigrams and at 
> query time the analyzer could analyze queries as bigrams unless the query 
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for 
> indexing with the "indexUnigrams" flag set and the analyzer for querying 
> without the flag. 
> 
> −
> 
>
> han="true"/>
> 
> 
>
>
> 
> 
> Use case: About 10% of our queries that contain Han characters are single 
> character queries.   The CJKBigram filter only outputs single characters when 
> there are no adjacent bigrammable characters in the input.  This means we 
> have to create a separate field to index Han unigrams in order to address 
> single character queries and then write application code to search that 
> separate field if we detect a single character Han query.  This is rather 
> kludgey.  With the optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
> used to allow single word queries (although that uses word n-grams rather 
> than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

2012-08-06 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429217#comment-13429217
 ] 

Tom Burton-West commented on LUCENE-4286:
-

We haven't had a request for this specific feature from readers, we are just 
assuming that the 10% of Han queries in our logs that consist of a single 
character represent real use cases and we don't want such queries to produce 
zero results or produce misleading results.

Tom

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 3.6.1
>Reporter: Tom Burton-West
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch
>
>
> Add an optional  flag to the CJKBigramFilter to tell it to also output 
> unigrams.   This would allow indexing of both bigrams and unigrams and at 
> query time the analyzer could analyze queries as bigrams unless the query 
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for 
> indexing with the "indexUnigrams" flag set and the analyzer for querying 
> without the flag. 
> 
> −
> 
>
> han="true"/>
> 
> 
>
>
> 
> 
> Use case: About 10% of our queries that contain Han characters are single 
> character queries.   The CJKBigram filter only outputs single characters when 
> there are no adjacent bigrammable characters in the input.  This means we 
> have to create a separate field to index Han unigrams in order to address 
> single character queries and then write application code to search that 
> separate field if we detect a single character Han query.  This is rather 
> kludgey.  With the optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
> used to allow single word queries (although that uses word n-grams rather 
> than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)

2012-08-03 Thread Tom Burton-West
Thanks Robert,

Opened:LUCENE-4286 <https://issues.apache.org/jira/browse/LUCENE-4286>

Tom

On Fri, Aug 3, 2012 at 6:22 PM, Robert Muir  wrote:

> Tom, please open an issue for this.
>
> On Fri, Aug 3, 2012 at 6:19 PM, Tom Burton-West 
> wrote:
> > Hello all,
> >
> > About 10% of our queries that contain Han characters are single character
> > queries.   It looks like the CJKBigram filter only outputs single
> characters
> > when there are no adjacent bigrammable characters in the input.   This
> means
> > we have to create a separate field to index Han unigrams in order to
> address
> > single character queries and then write application code to search that
> > separate field if we detect a single character Han query.  This is rather
> > kludgey.As an alternative approach to dealing with single character
> Han
> > queryies, would it be possible to add an optional  flag to the
> > CJKBigramFilter to tell it to also output unigrams?
> >
> > That way on indexing we could set the flag so that both unigrams and
> bigrams
> > would be indexed.  On querying we would not set the flag so that the
> current
> > logic which outputs bigrams unless there is a single Han character (in
> which
> > case that gets output) would take care of queries containing a single Han
> > unigram.
> >
> > This is somewhat analogus to the flags in LUCENE-1370 for the
> ShingleFilter.
> >
> > If this makes sense I'll open a JIRA issue.
> >
> > Tom Burton-West
>
>
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


  1   2   >