Outdated information on JVM heap sizes in Solr 8.3 documentation?

2020-02-14 Thread Tom Burton-West
Hello, In the section on JVM tuning in the Solr 8.3 documentation ( https://lucene.apache.org/solr/guide/8_3/jvm-settings.html#jvm-settings) there is a paragraph which cautions about setting heap sizes over 2 GB: "The larger the heap the longer it takes to do garbage collection. This can mean

Re: loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
ad (autowarming > in solrconfig doesn't count). > > On Fri, Aug 17, 2018 at 8:57 AM, Tom Burton-West > wrote: > > Hello, > > > > I'm not using SolrCloud and want to have some cores not load when Solr > > starts up. > > I tried loadOnStartup=false, but the co

loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
Hello, I'm not using SolrCloud and want to have some cores not load when Solr starts up. I tried loadOnStartup=false, but the cores seem to start up anyway. Is the loadOnStartup parameter still usable with Solr 6.6 or does the documentation need updating? Or Is there something else I need to

Re: Can the export handler be used with the edismax or dismax query handler

2018-07-29 Thread Tom Burton-West
> > https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html > > > > Best, > > Erick > > > > On Fri, Jul 27, 2018 at 9:47 AM, Tom Burton-West > > wrote: > > > Thanks Joel, > > > > > > My use case is that I have a complex edi

Re: Can the export handler be used with the edismax or dismax query handler

2018-07-27 Thread Tom Burton-West
score at this time. It only > supports sorting on fields. So the edismax qparser won't cxcurrently work > with the export handler. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West > wrote: > > > Hello all, > > &

Can the export handler be used with the edismax or dismax query handler

2018-07-26 Thread Tom Burton-West
Hello all, I am completely new to the export handler. Can the export handler be used with the edismax or dismax query handler? I tried using local params : q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50' mm='100%25' tie='0.9' } art" which does not seem to be working. Tom

Error in Solr 6.6 Example schemas re: DocValues for StrField type must be single-valued?

2017-08-15 Thread Tom Burton-West
/DocValuesType.html Is the comment in the example schema file completely wrong, or is there some issue with using a docValues with a multivalued StrField? Tom Burton-West https://www.hathitrust.org/blogslarge-scale-search

Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
Hi David, It may not matter for your use case but just in case you really are interested in the "real BM25F" there is a difference between configuring K1 and B for different fields in Solr and a "real" BM25F implementation. This has to do with Solr's model of fields being mini-documents (i.e.

Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Tom Burton-West
Hello all, The last time I worked with changing Simlarities was with Solr 4.1 and at that time, it was possible to simply change the schema to specify the use of a different Similarity without re-indexing. This allowed me to experiment with several different ranking algorithms without having to

Re: How to configure Solr PostingsFormat block size

2015-03-12 Thread Tom Burton-West
Hi Hoss, I created a wrapper class, compiled a jar and included an org.apache.lucene.codecs.Codec file in META-INF/services in the jar file with an entry for the wrapper class :HTPostingsFormatWrapper. I created a collection1/lib directory and put the jar there. (see below) I'm getting the

Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
Hi Rishi, As others have indicated Multilingual search is very difficult to do well. At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop

Optimize maxSegments=2 not working right with Solr 4.10.2

2015-02-23 Thread Tom Burton-West
Hello, We normally run an optimize with maxSegments=2 after our daily indexing. This has worked without problem on Solr 3.6. We recently moved to Solr 4.10.2 and on several shards the optimize completed with no errors in the logs, but left more than 2 segments. We send this xml to Solr

Re: Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Thanks Hoss, Protection from misconfiguration and/or starting separate solr instances pointing to the same index dir I can understand. The current documentation on the wiki and in the ref guide (along with just enough understanding of Solr/Lucene indexing to be dangerous) left me wondering if

Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Hello, We don't want to use locktype=native (we are using NFS) or locktype=simple (we mount a read-only snapshot of the index on our search servers and with locktype=simple, Solr refuses to start up becaise it sees the lock file.) However, we don't quite understand the warnings about using

Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Tom Burton-West
Hello, I'm running Solr 4.10.2 out of the box with the Solr example. i.e. ant example cd solr/example java -jar start.jar in /example/log At start-up the example gives this message in the log: WARN - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers; Multiple requestHandler

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Michael and Hoss, assuming I've written the subclass of the postings format, I need to tell Solr to use it. Do I just do something like: fieldType name=ocr class=solr.TextField postingsFormat=MySubclass / Is there a way to set this for all fieldtypes or would that require writing a

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Hoss, This is starting to sound pretty complicated. Are you saying this is not doable with Solr 4.10? ...or at least: that's how it *should* work :) makes me a bit nervous about trying this on my own. Should I open a JIRA issue or am I probably the only person with a use case for

How to configure Solr PostingsFormat block size

2015-01-12 Thread Tom Burton-West
Hello all, Our indexes have around 3 billion unique terms, so for Solr 3, we set TermIndexInterval to about 8 times the default. The net effect of this is to reduce the size of the in-memory index by about 1/8th. (For background see for

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-12 Thread Tom Burton-West
Thanks everybody for the information. Shawn, thanks for bringing up the issues around making sure each document is indexed ok. With our current architecture, that is important for us. Yonik's clarification about streaming really helped me to understand one of the main advantages of CUSS: When

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Tom Burton-West
Thanks Eric, That is helpful. We already have a process that works similarly. Each thread/process that sends a document to Solr waits until it gets a response in order to make sure that the document was indexed successfully (we log errors and retry docs that don't get indexed successfully),

Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-10 Thread Tom Burton-West
Hello all, In the example schema.xml for Solr 4.10.2 this comment is listed under the PERFORMANCE NOTE For maximum indexing performance, use the ConcurrentUpdateSolrServer java client. Is there some documentation somewhere that explains why this will maximize indexing peformance? In

Re: Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-24 Thread Tom Burton-West
Thanks Hoss, Just opened SOLR-6560 and attached a patch which removes the offending section from the example solrconfig.xml file. We suspect that with the much more efficient block and FST based Solr 4 default postings format that the need to mess with the parameters in order to reduce memory

queryResultMaxDocsCached vs queryResultWindowSize

2014-09-23 Thread Tom Burton-West
Hello, queryResultWindowSize sets the number of documents to cache for each query in the queryResult cache.So if you normally output 10 results per pages, and users don't go beyond page 3 of results, you could set queryResultWindowSize to 30 and the second and third page requests will read

How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-17 Thread Tom Burton-West
The Solr wiki says A repeated question is how can I have the original term contribute more to the score than the stemmed version? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming

Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-16 Thread Tom Burton-West
Hello, I think the documentation and example files for Solr 4.x need to be updated. If someone will let me know I'll be happy to fix the example and perhaps someone with edit rights could fix the reference guide. Due to dirty OCR and over 400 languages we have over 2 billion unique terms in our

spam detection issue on sending legitimate mail to Solr list

2014-09-15 Thread Tom Burton-West
(6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_ LOW,SPF_NEUTRAL,URIBL_SBL Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search

Re: How to implement multilingual word components fields schema?

2014-09-15 Thread Tom Burton-West
Hi Ilia, I see that Trey answered your question about how you might stack language specific filters in one field. If I remember correctly, his approach assumes you have identified the language of the query. That is not the same as detecting the script of the query and is much harder. Trying to

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West
/10.1145/2600428.2609622 Code: http://users.dsic.upv.es/~pgupta/mixed-script-ir.html Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Fri, Sep 5, 2014 at 10:06 AM

Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-21 Thread Tom Burton-West
Hi Ken, Given the comments which seemed to describe using NRT for the opposite of our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory. Didn't bother to test whether NRT would be better for our use case, mostly because it didn't sound like there was an advantage and I've

Re: tf and very short text fields

2014-04-04 Thread Tom Burton-West
Thanks Marcus, I was thinking about normalization and was absolutely wrong about setting K1 to zero. I should have taken a look at the algorithm and walked through setting K=0. (This is easier to do looking at the formula in wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though

Re: Analysis of Japanese characters

2014-04-03 Thread Tom Burton-West
Hi Shawn, For an input of 田中角栄 the bigram filter works like you described, and what I would expect. If I add a space at the point where the ICU tokenizer would have split them anyway, the bigram filter output is very different. If I'm understanding what you are reporting, I suspect this is

Re: tf and very short text fields

2014-04-03 Thread Tom Burton-West
Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can tell it to make bigrams of different Japanese

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I may still be missing your point. Below is an example where the ICUTokenizer splits Now, I'm beginning to wonder if I really understand what those flags on the CJKBigramFilter do. The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them back together into bigrams. I

Re: Indexing large documents

2014-03-19 Thread Tom Burton-West
be appropriate for your use case as Otis suggested. In our use case sometimes this is appropriate, but we are investigating the possibility of other methods of scoring the group based on a more flexible function of the scores of the members (i.e scoring book based on function of scores of chapters). Tom Burton

Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Hello, I'm running the example setup for Solr 4.6.1. In the ../example/solr/ directory, I set up a second core. I wanted to send updates to that core. I looked at .../exampledocs/post.sh and expected to see the URL as: URL= http://localhost:8983/solr/collection1/update However it does

Re: Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Thanks Hoss, hardcoded default of collection1 is still used for backcompat when there is no defaultCoreName configured by the user. Aha, it's hardcoded if there is nothing set in a config. No wonder I couldn't find it by grepping around the config files. I'm still trying to sort out the old

Re: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Tom Burton-West
of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
...@elyograg.org wrote: On 8/27/2013 4:29 PM, Tom Burton-West wrote: According to the README.txt in solr-4.4.0/solr/example/solr/** collection1, all we have to do is create a collection1/lib directory and put whatever jars we want in there. .. /lib. If it exists, Solr will load any Jars

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
My point in the previous e-mail was that following the instructions in the documentation does not seem to work. The workaround I found was to simply change the name of the collection1/lib directory to collection1/foobar and then include it in solrconfig.xml. lib dir=./foobar / This

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
optional configuration files would also be kept here. data/ This directory is the default location where Solr will keep your ... lib/ On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote: On 8/28/2013 9:34 AM, Tom Burton-West wrote: I think I am running

ICUTokenizer class not found with Solr 4.4

2013-08-27 Thread Tom Burton-West
Hello all, According to the README.txt in solr-4.4.0/solr/example/solr/collection1, all we have to do is create a collection1/lib directory and put whatever jars we want in there. .. /lib. If it exists, Solr will load any Jars found in this directory and use them to resolve any

How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set discountOverlaps=true on the factory or per field? What is the syntax? The below does not seem to work similarity class=solr.BM25SimilarityFactory discountOverlaps=true similarity

Re: How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
Thanks Markus, I set it , but it seems to make no difference in the score or statistics listed in the debugQuery or in the ranking. I'm using a field with CommonGrams and a huge list of common words, so there should be a huge difference in the document length with and without discountOverlaps.

Re: How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
I should have said that I have set it both to true and to false and restarted Solr each time and the rankings and info in the debug query showed no change. Does this have to be set at index time? Tom

Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hello, I am running solr 4.2.1 on 3 shards and have about 365 million documents in the index total. I sent a query asking for 1 million rows at a time, but I keep getting an error claiming that there is an invalid version or data not in javabin format (see below) If I lower the number of rows

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
=10 works for you, consider yourself lucky! That said, there is sometimes talk of supporting streaming, which presumably would allow access to all results, but chunked/paged in some way. -- Jack Krupansky -Original Message- From: Tom Burton-West Sent: Thursday, July 25, 2013 1:39 PM

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Thanks Shawn, I was confused by the error message: Invalid version (expected 2, but 60) or the data in not in 'javabin' format Your explanation makes sense. I didn't think about what the shards have to send back to the head shard. Now that I look in my logs, I can see the posts that the shards

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
path=/select params={fl=vol_idindent=onstart=3400q=*:*rows=100} hits=119220943 status=0 QTime=58699 On Thu, Jul 25, 2013 at 6:18 PM, Shawn Heisey s...@elyograg.org wrote: On 7/25/2013 3:09 PM, Tom Burton-West wrote: Thanks Shawn, I was confused by the error message: Invalid version

Re: What does too many merges...stalling in indexwriter log mean?

2013-07-12 Thread Tom Burton-West
. Tom On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey s...@elyograg.org wrote: On 7/11/2013 1:47 PM, Tom Burton-West wrote: We are seeing the message too many merges...stalling in our indexwriter log. Is this something to be concerned about? Does it mean we need to tune something in our

What does too many merges...stalling in indexwriter log mean?

2013-07-11 Thread Tom Burton-West
Hello, We are seeing the message too many merges...stalling in our indexwriter log. Is this something to be concerned about? Does it mean we need to tune something in our indexing configuration? Tom

When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Tom Burton-West
Hello all, The default directory implementation in Solr 4 is the NRTCachingDirectory (in the example solrconfig.xml file , see below). The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: This

Solr 4.x replacement for termsIndexDivisor

2013-05-21 Thread Tom Burton-West
Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again ). In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We

Re: Slow queries for common terms

2013-03-22 Thread Tom Burton-West
Hi David and Jan, I wrote the blog post, and David, you are right, the problem we had was with phrase queries because our positions lists are so huge. Boolean queries don't need to read the positions lists. I think you need to determine whether you are CPU bound or I/O bound.It is possible

ngrams or truncation for multilingual searching in Solr

2013-02-05 Thread Tom Burton-West
York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957 Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Hello all, I have a one term query: ocr:aardvark When I look at the explain output, for some matches the queryNorm and fieldWeight are shown and for some matches only the weight is shown with no query norm. (See below) What explains the difference? Shouldn't the queryNorm be applied to each

Re: Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Thanks Hoss, Yes it is a distributed query. Tom On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I have a one term query: ocr:aardvark When I look at the explain : output, for some matches the queryNorm and fieldWeight are shown and for : some matches

coord missing from debugQuery explain?

2013-01-08 Thread Tom Burton-West
Hello, I'm trying to understand some Solr relevance issues using debugQuery=on, but I don't see the coord factor listed anywhere in the explain output. My understanding is that the coord factor is not included in either the querynorm or the fieldnorm. What am I missing? Tom

Best practices for Solr highlighter for CJK

2013-01-02 Thread Tom Burton-West
. i.e. ABC = searched as AB BC only AB gets highlighted even if the matching string is ABC. (Where ABC are chinese characters such as 大亚湾 = searched as 大亚 亚湾, but only 大亚 is highlighted rather than 大亚湾) Is there some highlighting parameter that might fix this? Tom Burton-West

ICUTokenizer labels number as Han character?

2012-12-19 Thread Tom Burton-West
Hello, Don't know if the Solr admin panel is lying, or if this is a wierd bug. The string: 1986年 gets analyzed by the ICUTokenizer with 1986 being identified as type:NUM and script:Han. Then the CJKBigram filter identifies 1986 as type:Num and script:Han and 年 as type:Single and script: Common.

configuring per-field similarity in Solr 4: the global similarity does not support it

2012-12-17 Thread Tom Burton-West
Hello, I have Solr 4 configured with several fields using different similarity classes according to: http://wiki.apache.org/solr/SchemaXml#Similarity However, I get this error message: FieldType 'DFR' is configured with a similarity, but the global similarity does not support it: class

How to configure termvectors to not store positions/offsets

2012-12-13 Thread Tom Burton-West
Hello, As I understand it, MoreLikeThis only requires term frequencies, not positions or offsets. So in order to save disk space I would like to store termvectors, but without positions and offsets. Is there documentation somewhere that 1) would confirm that MoreLikeThis only needs term

Re: BM25 model for solr 4?

2012-11-15 Thread Tom Burton-West
Hello Floyd, There is a ton of research literature out there comparing BM25 to vector space. But you have to be careful interpreting it. BM25 originally beat the SMART vector space model in the early TRECs because it did better tf and length normalization. Pivoted Document Length

URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
Hello, I would like to send a request to the FieldAnalysisRequestHandler. The javadoc lists the parameter names such as analysis.field, but sending those as URL parameters does not seem to work: mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly leaving out the analysis doesn't

Re: URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
analysis.jsp like before? So maybe try using something like burpsuite and just using the analysis UI in your browser to see what requests its sending. On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West tburt...@umich.edu wrote: Hello, I would like to send a request

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West
Hi Markus, No answers, but I am very interested in what you find out. We currently index all languages in one index, which presents different IDF issues, but are interested in exploring alternatives such as the one you describe. Tom Burton-West http://www.hathitrust.org/blogs/large-scale

Solr 4.0 error message: Unsupported ContentType: Content-type:text/xml

2012-11-02 Thread Tom Burton-West
Hello all, Trying to get Solr 4.0 up and running with a port of our production 3.6 schema and documents. We are getting the following error message in the logs: org.apache.solr.common.SolrException: Unsupported ContentType: Content-type:text/xml Not in: [app lication/xml, text/csv, text/json,

Re: Solr 4.0 error message: Unsupported ContentType: Content-type:text/xml

2012-11-02 Thread Tom Burton-West
it sounds as if the literal text Content-type: is included in your content type. How exactly are you setting/sending the content type? -- Jack Krupansky -Original Message- From: Tom Burton-West Sent: Friday, November 02, 2012 5:30 PM To: solr-user@lucene.apache.org Subject: Solr 4.0

Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
name=parsedquerytext:fire text:fly/str If a correct dismax query was being sent to Solr the parsedquery would have something like the following: str name=parsedquery(+DisjunctionMaxQuery(((text:fire text:fly))) Tom Burton-West

Re: Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
be defType=dismax Erik On Sep 13, 2012, at 12:22 , Tom Burton-West wrote: Just want to check I am not doing something obviously wrong before I file a bug ticket. In Solr 4.0Beta, in the admin UI in the Query panel,, there is a checkbox option to check dismax or edismax

Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Hello all, Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again). In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
: these parameters don't make sense for it. On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West tburt...@umich.edu wrote: Hello all, Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
...@gmail.com wrote: On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Robert, I'll have to spend some time understanding the default codec for Solr 4.0. Did I miss something in the changes file? http://lucene.apache.org/core/4_0_0-BETA/ see the file formats

Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I removed the string collection1 from my solr.xml file in solr home and modified my solr.xml file as follows: cores adminPath=/admin/cores defaultCoreName=foobar1 host=${host:} hostPort=${jetty.port:} zkClientTimeout=${zkClientTimeout:15000} core name=foobarcorename instanceDir=. / /cores

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I did not describe the problems correctly. I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and .../solrs/4.0/2solrs/3 For shard 1 I have a solr.xml file with the modifications described in the previous message. For that instance, it appears that the problem is that the

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
-3753 On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West tburt...@umich.edu wrote: I did not describe the problems correctly. I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and .../solrs/4.0/2solrs/3 For shard 1 I have a solr.xml file with the modifications described

Re: Solr 4.0 Beta missing example/conf files?

2012-08-23 Thread Tom Burton-West
, Erik On Aug 22, 2012, at 16:32 , Tom Burton-West wrote: Thanks Markus! Should the README.txt file in solr/example be updated to reflect this? Is that something I need to enter a JIRA issue for? Tom On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma markus.jel

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
thread on Solr3.6 Field collapsing Thanks, Tirthankar -Original Message- From: Tom Burton-West tburt...@umich.edu Date: Tue, 21 Aug 2012 18:39:25 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar, Can you give me a quick summary of what won't work and why? I couldn't figure it out from looking at your thread. You seem to have a different issue, but maybe I'm missing something here. Tom On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee tchatter...@commvault.com

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance and Tirthankar, We are currently using Solr 3.6. I tried a search across our current 12 shards grouping by book id (record_no in our schema) and it seems to work fine (the query with the actual urls for the shards changed is appended below.) I then searched for the record_no of the

Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Hello, Usually in the example/solr file in Solr distributions there is a populated conf file. However in the distribution I downloaded of solr 4.0.0-BETA, there is no /conf directory. Has this been moved somewhere? Tom ls -l apache-solr-4.0.0-BETA/example/solr total 107 drwxr-sr-x 2 tburtonw

Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Thanks Markus! Should the README.txt file in solr/example be updated to reflect this? Is that something I need to enter a JIRA issue for? Tom On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi - The example has been moved to collection1/ -Original

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is

Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tom Burton-West
users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library

Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-07-02 Thread Tom Burton-West
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which also lists a couple other related mailing list posts. On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West tburt...@umich.eduwrote: Hello, My previous e-mail with a CJK example has received no replies. I verified

edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-06-28 Thread Tom Burton-West
, but want to find out if I am missing something here. Details of several queries are appended below. Tom Burton-West edismax query mm=2 query with hypenated word [fire-fly] lst name=debug str name=rawquerystring{!edismax mm=2}fire-fly/str str name=querystring{!edismax mm=2}fire-fly/str str name

edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

2012-06-26 Thread Tom Burton-West
] turns into a Boolean OR query for ( [two] OR [thirds] ). Is there some way to tell the edismax query parser to stick with mm =100%? Appended below is the debugQuery output for these two queries and an exceprt from our schema.xml. Tom Tom Burton-West http://www.hathitrust.org/blogs/large-scale

What is the docs number in Solr explain query results for fieldnorm?

2012-05-25 Thread Tom Burton-West
, maxDocs=17707) 0.625 = fieldNorm(field=ocr, doc=16624) /str Tom Burton-West - str name=78562575E066497D-518 0.42061833 = (MATCH) fieldWeight(ocr:the in 8396), product of: 7.071068 = tf(termFreq(ocr:the)=50) 1.087715 = idf(docFreq=16219, maxDocs=17707) 0.0546875 = fieldNorm(field

boost not showing up in Solr 3.6 debugQueries?

2012-05-17 Thread Tom Burton-West
and this is one of the querie from our log. Tom Burton-West lst name=debug str name=rawquerystring 兵にな^1000 OR hanUnigrams:兵にな/str str name=querystring 兵にな^1000 OR hanUnigrams:兵にな/str str name=parsedquery((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵/str str name=parsedquery_toString((+ocr:兵に +ocr:にな

Re: Solr RAM Requirements

2010-03-17 Thread Tom Burton-West
. You also might want to take a look at the free memory when you start up Solr and then watch as it fills up as you get more queries (or send cache-warming queries). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search KaktuChakarabati wrote: My question was mainly about

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: Since you are dealing with multiple langugaes, and multiple varient usages of langauges

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-19 Thread Tom Burton-West
Hi Glen, I'd love to use LuSql, but our data is not in a db. Its 6-8TB of files containing OCR (one file per page for about 1.5 billion pages) gzipped on disk which are ugzipped, concatenated, and converted to Solr documents on-the-fly. We have multiple instances of our Solr document producer

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-18 Thread Tom Burton-West
Thanks Otis, I don't know enough about Hadoop to understand the advantage of using Hadoop in this use case. How would using Hadoop differ from distributing the indexing over 10 shards on 10 machines with Solr? Tom Otis Gospodnetic wrote: Hi Tom, 32MB is very low, 320MB is medium, and

Re: persistent cache

2010-02-15 Thread Tom Burton-West
Hi Tim, Due to our performance needs we optimize the index early in the morning and then run the cache-warming queries once we mount the optimized index on our servers. If you are indexing and serving using the same Solr instance, you shouldn't have to re-run the cache warming queries when you

Re: persistent cache

2010-02-12 Thread Tom Burton-West
overview of the issues is the paper by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of Caching on Search Engines ) Tom Burton-West Digital Library Production Service University of Michigan Library -- View this message in context: http://old.nabble.com/persistent-cache

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
Thanks Lance and Michael, We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from Solr admin panel appended below) I tried running CheckIndex (with the -ea: switch ) on one of the shards. CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger segment

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
Thanks Michael, I'm not sure I understand. CheckIndex reported a negative number: -16777214. But in any case we can certainly try running CheckIndex from a patched lucene We could also run a patched lucene on our dev server. Tom Yes, the term count reported by CheckIndex is the total

Re: Thanks Robert!

2010-02-05 Thread Tom Burton-West
+1 And thanks to you both for all your work on CommonGrams! Tom Burton-West Jason Rutherglen-2 wrote: Robert, thanks for redoing all the Solr analyzers to the new API! It helps to have many examples to work from, best practices so to speak. -- View this message in context: http

  1   2   >