Re: Parsing dating during indexing - Year Only
I'm not sure i understand your question ... if you know that you are only ever going to have the 'year' then why not just index the year as an int? a TrieDateField isn't really of any use to you, because normal date type usage (date math, date ranges) are useless because you don't have any real date values (ie: it's ambiguous wether 2007 should match just_the_year:[2006-06-01T00:00:00Z TO 2007-06-01T00:00:00Z]) If you really need a true date field because *most* of your documents have real dates, but only sometimes do you injest documents with only the year, and when you injest documents like this you wnat to assume some fixed month/day/hour/etc... then you can easily do this with update processors ... consider a chain of... RegexReplaceProcessorFactory: just_the_year: ^(\d+)$ - $1-01-01T00:00:00Z CloneFieldUpdateProcessor: just_the_year - real_date_field FirstFieldValueUpdateProcessorFactory: real_date_field (if a doc already had a value in the real field, ignore the new year only value) https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/FirstFieldValueUpdateProcessorFactory.html : Date: Fri, 19 Jun 2015 13:57:04 -0700 (MST) : From: levanDev levandev9...@gmail.com : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Parsing dating during indexing - Year Only : : Hello, : : Example csv doc has column 'just_the_year' and value '2010': : : With the Schema API I can tell the indexing process to treat 'just_the_year' : as a date field. : : I know that I can update the solrconfig.xml to correctly parse formats such : as MM/dd/ (which is awesome) but has anyone tried to covert just the : year value to a full date (2010-01-01T00:00:00Z) by updating the : solrconfig.xml? : : I know it's possible to import csv, do the date transformation, export again : and have everything work nicely but it would be cool to reduce the number of : steps involved and use the powerful date processor. : : Thank you, : Levan : : : : -- : View this message in context: http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss http://www.lucidworks.com/
Parsing dating during indexing - Year Only
Hello, Example csv doc has column 'just_the_year' and value '2010': With the Schema API I can tell the indexing process to treat 'just_the_year' as a date field. I know that I can update the solrconfig.xml to correctly parse formats such as MM/dd/ (which is awesome) but has anyone tried to covert just the year value to a full date (2010-01-01T00:00:00Z) by updating the solrconfig.xml? I know it's possible to import csv, do the date transformation, export again and have everything work nicely but it would be cool to reduce the number of steps involved and use the powerful date processor. Thank you, Levan -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CollapseQParserPluging Incorrect Facet Counts
The CollapsingQParserPlugin does not provide facet counts that are them same as the group.facet feature in Grouping. It provides facet counts that behave like group.truncate. The CollapsingQParserPlugin only collapses the result set. The facets counts are then generated for the collapsed result set by the FacetComponent. This has been a hot topic of late. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto charlie.mar...@gmail.com wrote: Hi, We are comparing results between Field Collapsing (group* parameters) and CollapseQParserPlugin. We noticed that some facets are returning incorrect counts. Here are the relevant parameters of one of our test queries: Field Collapsing: --- q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true group.ngroups=true ngroups = 5964 lst name=searchcolorfacet ... int name=red11/int ... /lst CollapseQParserPlugin: --q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D numFound = 5964 (same) lst name=searchcolorfacet ... int name=red8/int ... /lst When we change the CollapseQParserPlugin query by adding fq=searchcolorfacet:red, the numFound value is 11, effectively showing all 11 hits with that color. The facet count for red now shows the correct value of 11 as well. Has anyone seeing something similar? Thanks, Carlos
Re: CollapseQParserPluging Incorrect Facet Counts
If you see the last comment on: https://issues.apache.org/jira/browse/SOLR-6143 You'll see there is a discussion starting about adding this feature. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 4:14 PM, Joel Bernstein joels...@gmail.com wrote: The CollapsingQParserPlugin does not provide facet counts that are them same as the group.facet feature in Grouping. It provides facet counts that behave like group.truncate. The CollapsingQParserPlugin only collapses the result set. The facets counts are then generated for the collapsed result set by the FacetComponent. This has been a hot topic of late. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto charlie.mar...@gmail.com wrote: Hi, We are comparing results between Field Collapsing (group* parameters) and CollapseQParserPlugin. We noticed that some facets are returning incorrect counts. Here are the relevant parameters of one of our test queries: Field Collapsing: --- q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true group.ngroups=true ngroups = 5964 lst name=searchcolorfacet ... int name=red11/int ... /lst CollapseQParserPlugin: --q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D numFound = 5964 (same) lst name=searchcolorfacet ... int name=red8/int ... /lst When we change the CollapseQParserPlugin query by adding fq=searchcolorfacet:red, the numFound value is 11, effectively showing all 11 hits with that color. The facet count for red now shows the correct value of 11 as well. Has anyone seeing something similar? Thanks, Carlos
RE: CollapseQParserPluging Incorrect Facet Counts
Thanks Joel, I don't know why I was unable to find the understanding collapsing email thread via the search I did on the site but I found it in my own email search now. We'll look into our specific scenario and see if we can find a workaround. Thanks! CARLOS MAROTO M +1 626 354 7750 -Original Message- From: Joel Bernstein [mailto:joels...@gmail.com] Sent: Friday, June 19, 2015 1:18 PM To: solr-user@lucene.apache.org Subject: Re: CollapseQParserPluging Incorrect Facet Counts If you see the last comment on: https://issues.apache.org/jira/browse/SOLR-6143 You'll see there is a discussion starting about adding this feature. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 4:14 PM, Joel Bernstein joels...@gmail.com wrote: The CollapsingQParserPlugin does not provide facet counts that are them same as the group.facet feature in Grouping. It provides facet counts that behave like group.truncate. The CollapsingQParserPlugin only collapses the result set. The facets counts are then generated for the collapsed result set by the FacetComponent. This has been a hot topic of late. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 3:54 PM, Carlos Maroto charlie.mar...@gmail.com wrote: Hi, We are comparing results between Field Collapsing (group* parameters) and CollapseQParserPlugin. We noticed that some facets are returning incorrect counts. Here are the relevant parameters of one of our test queries: Field Collapsing: --- q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field= searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true group.ngroups=true ngroups = 5964 lst name=searchcolorfacet ... int name=red11/int ... /lst CollapseQParserPlugin: --q=red%20dressfacet=truefacet.minc ount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20 field=groupid%7D numFound = 5964 (same) lst name=searchcolorfacet ... int name=red8/int ... /lst When we change the CollapseQParserPlugin query by adding fq=searchcolorfacet:red, the numFound value is 11, effectively showing all 11 hits with that color. The facet count for red now shows the correct value of 11 as well. Has anyone seeing something similar? Thanks, Carlos
Re: Parsing dating during indexing - Year Only
Hi Chris, Thank you for taking the time to write the detailed response. Very helpful. Dealing with interesting formats in the source data and trying to evaluate various options for our business needs. The second scenario you described (where some values in the date field are just the year) will either come up pretty soon for me or will certainly help someone else dealing with that issue currently. Thank you, Levan -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-date-during-indexing-Year-Only-tp4213045p4213065.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Parsing dating during indexing - Year Only
Hmm, I can see some things you couldn't do with just using a tint field for the year. Or rather, some things that wouldn't be as convenient But this might help: http://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html or you can also consider a http://wiki.apache.org/solr/ScriptUpdateProcessor Best, Erick On Fri, Jun 19, 2015 at 1:57 PM, levanDev levandev9...@gmail.com wrote: Hello, Example csv doc has column 'just_the_year' and value '2010': With the Schema API I can tell the indexing process to treat 'just_the_year' as a date field. I know that I can update the solrconfig.xml to correctly parse formats such as MM/dd/ (which is awesome) but has anyone tried to covert just the year value to a full date (2010-01-01T00:00:00Z) by updating the solrconfig.xml? I know it's possible to import csv, do the date transformation, export again and have everything work nicely but it would be cool to reduce the number of steps involved and use the powerful date processor. Thank you, Levan -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-dating-during-indexing-Year-Only-tp4213045.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How to do a Data sharding for data in a database table
As stated previously, using Field Collapsing (group parameters) tends to significantly slow down queries. In my experience, search response gets even worst when: - Requesting facets, which more often than not I do in my query formulation - Asking for the facet counts to be on the groups via the group.facet=true parameter (way worst in some of my use cases that had a lot of distinct values for at least one of the facets) - Queries are matching many hits, i.e. individual counts (hundreds of thousands or more in our case) and total groups counts (in the few thousands) Also stated by someone, switching to CollapseQParserPlugin will likely reduce significantly the response time given its different implementation. Using CollapseQParserPlugin means that you: 1- Have to change how the query gets created 2- May need to change how you consume the Solr response (depending on what you are using today) 3- Will not have the total number of individual hits (before collapsing count) because the numFound returned by the CollapseQParserPlugin represents the total number of groups (like groups.ngroups does) 4- You may have an issue with facet value counts not being exact in the CollapseQParserPlugin response With respect to sharding, there are multiple considerations. The most relevant given your need for grouping is to implement custom routing of documents to shards so that all members of a group are indexed in the same shard, if you can. Otherwise your grouping across shards will have some issues (particularly with counts, I believe.) CARLOS MAROTO http://www.searchtechnologies.com/ M +1 626 354 7750 -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 12:08 PM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Also, since you are tuning for relative times, you can tune on the smaller index. Surely, you will want to test at scale. But tuning query, analyzer or schema options is usually easier to do on a smaller index. If you get a 3x improvement at small scale, it may only be 2.5x at full scale. E.g. storing the group field as doc values is one option that can help grouping performance in some cases (at least according to this list, I haven't tried it yet). The number of distinct values of the grouping field is important as well. If there are very many, you may want to try CollapsingQParserPlugin. The point being, some of these options may require reindexing! So, again, it is a much easier and faster process to tune on a smaller index. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, June 19, 2015 2:33 PM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table Do be aware that turning on debug=query adds a load. I've seen the debug component take 90% of the query time. (to be fair it usually takes a much smaller percentage). But you'll see a section at the end of the response if you set debug=all with the time each component took so you'll have a sense of the relative time used by each component. Best, Erick On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote: As for now, the index size is 6.5 M records, and the performance is good enough. I will re-build the index for all the records (14 M) and test it again with debug turned on. Thanks On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com wrote: First and most obvious thing to try: bq: the Solr was started with maximal 4G for JVM, and index size is 2G Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very loosely coupled to JVM requirements. It's quite possible that you're spending all your time in GC cycles. Consider gathering GC characteristics, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ As Charles says, on the face of it the system you describe should handle quite a load, so it feels like things can be tuned and you won't have to resort to sharding. Sharding inevitably imposes some overhead so it's best to go there last. From my perspective, this is, indeed, an XY problem. You're assuming that sharding is your solution. But you really haven't identified the _problem_ other than queries are too slow. Let's nail down the reason queries are taking a second before jumping into sharding. I've just spent too much of my life fixing the wrong thing ;) It would be useful to see a couple of sample queries so we can get a feel for how complex they are. Especially if you append, as Charles mentions, debug=true Best, Erick On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is
Re: understanding collapsingQParser with facet vs group.facet
Hi Upayavira Thank you for your explanation onthe difference between traditional grouping and collapsingQParser. I understand more now. On 6/19/2015 7:11 PM, Upayavira wrote: On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? I'm not familiar with group.facet. But to compare traditional grouping to the collapsingQParser - in traditional grouping, all matching documents remain in the result set, but they are grouped for output purposes. However, the collapsingQParser is actually a query filter. It will reduce the number of matching results. Any faceting that happens will happen on the filtered results. I wonder if you can use this syntax to achieve faceting alongside collapsing: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This way, you get the benefits of the CollapsingQParserPlugin, with full faceting on the uncollapsed resultset. I've no idea how this would perform, but I'd expect it to be better than the grouping option. Upayavira
Re: Auto-suggest in Solr
Ok sure. ngrams: The max number of tokens out of which singles will be make the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions. I got confused by this, as I could not get the behavior when I use the suggester. Since the default value is 2, it means the search for mp3 p should include only suggestions that contains mp3 ... and not just from the letter p. But I have only been getting suggestions that starts with p only. Even when I try with a bigger ngrams value for longer search, I'm getting the same results as well, that the suggester only consider the last token when giving the suggestions. I still could not achieve anything that consider 2 or more tokens when returning the suggestions. So am I actually following the right direction with this? Regards, Edwin On 19 June 2015 at 18:53, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Actually the documentation is not clear enough. Let's try to understand this suggester. *Building* This suggester build a FST that it will use to provide the autocomplete feature running prefix searches on it . The terms it uses to generate the FST are the tokens produced by the suggestFreeTextAnalyzerFieldType . And this should be correct. So if we have a shingle token filter[1-3] ( we produce unigrams as well) in our analysis to keep it simple , from these original field values : mp3 ipod mp3 player mp3 player ipod player of Real - we produce these list of possible suggestions in our FST : mp3 player ipod real of mp3 ipod mp3 player player ipod player of of real mp3 player ipod player of real From the documentation I read : ngrams: The max number of tokens out of which singles will be make the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions. This makes me confused, as I was not expecting this param to affect the suggestion dictionary. So I would like a clarification here from our masters :) At this point let's see what happens at query time . *Query Time * As my understanding the ngrams params will consider the last N-1 tokens the user put separated by the space separator. Builds an ngram model from the text sent to {@link * #build} and predicts based on the last grams-1 tokens in * the request sent to {@link #lookup}. This tries to * handle the long tail of suggestions for when the * incoming query is a never before seen query string. Example , grams=3 should consider only the last 2 tokens special mp3 p - mp3 p Then this query is analysed using the suggestFreeTextAnalyzerFieldType . We produce 3 tokens : mp3 p mp3 p And we run the prefix matching on the FST . *Conclusion* My understanding is wrong for sure at some point, as the behaviour I get is different. Can we discuss this , clarify this and eventually put it in the official documentation ? Cheers 2015-06-19 6:40 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm implementing an auto-suggest feature in Solr, and I'll like to achieve the follwing: For example, if the user enters mp3, Solr might suggest mp3 player, mp3 nano and mp3 music. When the user enters mp3 p, the suggestion should narrow down to mp3 player. Currently, when I type mp3 p, the suggester is returning words that starts with the letter p only, and I'm getting results like plan, production, etc, and it does not take the mp3 token into consideration. I'm using Solr 5.1 and below is my configuration: In solrconfig.xml: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=lookupImplFreeTextLookupFactory/str str name=indexPathsuggester_freetext_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=weightFieldProject/str str name=suggestFreeTextAnalyzerFieldTypesuggestType/str int name=ngrams5/int str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent In schema.xml fieldType name=suggestType class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^a-zA-Z0-9] replacement= / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=6 outputUnigrams=false/ /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^a-zA-Z0-9] replacement= / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=6 outputUnigrams=true/ /analyzer /fieldType Is there anything that I configured wrongly? Regards, Edwin --
Same query, inconsistent result in SolrCloud
Hi! I'm facing a problem. I'm using SolrCloud 4.10.3, with 2 shards, each shard have 2 replicas. After index data to the collection, and run the same query, http://localhost:8983/solr/catalog/select?q=awt=jsonindent=true Sometimes, it return the right, { responseHeader:{ status:0, QTime:19, params:{ indent:true, q:a, wt:json}}, response:{numFound:5,start:0,maxScore:0.43969032,docs:[ {},{},... ] } } But, when I re-run the same query, it return : { responseHeader:{ status:0, QTime:14, params:{ indent:true, q:a, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, highlighting:{}} Just some short word will show this kind of problem. Do anyone know what's going on? Thanks Regards, Jerome
Re: understanding collapsingQParser with facet vs group.facet
Hi Joel By group heads, is it referring to the document thatis use to represent each group in the main result section? Eg. Using the below 3 documentsandwe collapse on field supplier_id supplier_id:S1 product_id:P1 supplier_id:S2 product_id:P2 supplier_id:S2 product_id:P3 With collapse on supplier_id, the result in the main sectionis as follows, supplier_id:S1 product_id:P1 supplier_id:S2 product_id:P3 The group head of supplier_id:S1 is P1and supplier_id:S2 will be P3? Facets (and even sort) are calculated on P1 and P3? -Derek On 6/19/2015 7:05 PM, Joel Bernstein wrote: The CollapsingQParserPlugin currently doesn't calculate facets at all. It simply collapses the document set. The facets are then calculated only on the group heads. Grouping has special faceting code built into it that supports the group.facet functionality. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 6:20 AM, Derek Poh d...@globalsources.com wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? Thank you, Derek
Re: Help: Problem in customized token filter
Steve, Thank you thank you so much. You guys are awesome. Steve how can i learn more about the lucene indexing process in more detail. e.g. after we send documents for indexing which function calls till the doc actually store in index files. I will be thankful to you. If you guide me here. With Regards Aman Tandon On Fri, Jun 19, 2015 at 10:48 AM, Steve Rowe sar...@gmail.com wrote: Aman, Solr uses the same Token filter instances over and over, calling reset() before sending each document through. Your code sets “exhausted to true and then never sets it back to false, so the next time the token filter instance is used, its “exhausted value is still true, so no input stream tokens are concatenated ever again. Does that make sense? Steve www.lucidworks.com On Jun 19, 2015, at 1:10 AM, Aman Tandon amantandon...@gmail.com wrote: Hi Steve, you never set exhausted to false, and when the filter got reused, *it incorrectly carried state from the previous document.* Thanks for replying, but I am not able to understand this. With Regards Aman Tandon On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe sar...@gmail.com wrote: Hi Aman, The admin UI screenshot you linked to is from an older version of Solr - what version are you using? Lots of extraneous angle brackets and asterisks got into your email and made for a bunch of cleanup work before I could read or edit it. In the future, please put your code somewhere people can easily read it and copy/paste it into an editor: into a github gist or on a paste service, etc. Looks to me like your use of “exhausted” is unnecessary, and is likely the cause of the problem you saw (only one document getting processed): you never set exhausted to false, and when the filter got reused, it incorrectly carried state from the previous document. Here’s a simpler version that’s hopefully more correct and more efficient (2 fewer copies from the StringBuilder to the final token). Note: I didn’t test it: https://gist.github.com/sarowe/9b9a52b683869ced3a17 Steve www.lucidworks.com On Jun 18, 2015, at 11:33 AM, Aman Tandon amantandon...@gmail.com wrote: Please help, what wrong I am doing here. please guide me. With Regards Aman Tandon On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon amantandon...@gmail.com wrote: Hi, I created a *token concat filter* to concat all the tokens from token stream. It creates the concatenated token as expected. But when I am posting the xml containing more than 30,000 documents, then only first document is having the data of that field. *Schema:* *field name=titlex type=text indexed=true stored=false required=false omitNorms=false multiValued=false /* *fieldType name=text class=solr.TextField positionIncrementGap=100* * analyzer type=index* *charFilter class=solr.HTMLStripCharFilterFactory/* *tokenizer class=solr.StandardTokenizerFactory/* *filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/* *filter class=solr.LowerCaseFilterFactory/* *filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true tokenSeparator=/* *filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/* *filter class=com.xyz.analysis.concat.ConcatenateWordsFilterFactory/* *filter class=solr.SynonymFilterFactory synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true expand=true/* * /analyzer* * analyzer type=query* *tokenizer class=solr.StandardTokenizerFactory/* *filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/* *filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_text_prime_search.txt enablePositionIncrements=true /* *filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/* *filter class=solr.LowerCaseFilterFactory/* *filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/* *filter class=com.xyz.analysis.concat.ConcatenateWordsFilterFactory/* * /analyzer**/fieldType* Please help me, The code for the filter is as follows, please take a look. Here is the picture of what filter is doing http://i.imgur.com/THCsYtG.png?1 The code of concat filter is : *package com.xyz.analysis.concat;* *import java.io.IOException;* *import org.apache.lucene.analysis.TokenFilter;* *import org.apache.lucene.analysis.TokenStream;* *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;* *import
understanding collapsingQParser with facet vs group.facet
Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? Thank you, Derek
Limit indexed documents.
Hello i have a few questions for indexing data. Existing some hardware or software limits for indexing data? And is some maximum of indexed documents? Thanks for your answers. -- View this message in context: http://lucene.472066.n3.nabble.com/Limit-indexed-documents-tp4212913.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrJ: getBeans with multiple document types in response
Hello, I'm trying to parse Solr Responses with SolrJ, but the responses contain mixed types : for example 'song' documents and 'movie' documents with different fields. The getBeans method takes 1 class type as input parameter, this does not allow for mixed document types responses. What would be the best approach to parse the response and to get a list of 'entity' (the super class). I'm about to write another implementation of the DocumentObjectBinder class but I'd like to avoid that. Thanks!! François Catala Software Developer NUANCE COMMUNICATIONS, INC. 1500 University, Suite 557 Montréal QC H3A 3S7 514 904 7800 Officejust say my name or ext. 2345
Re: Error when submitting PDF to Solr w/text fields using SolrJ
Yeah I'm just gonna say hands down this was a totally bad question. My fault, mea culpa. I'm pretty new to working in an IDE environment and using a stack trace (I just finished my first year of CS at University and now I'm interning). I'm actually kind of embarrassed by how long it took me to realize I wasn't looking at the entire stack trace. Idiot moment of the week for sure. Thanks for the patience guys but when I looked at the entire stack trace it gave me this. Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=text (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[84, 104, 101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101, 112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message: bytes can be at most 32766 in length; got 44360 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) ... 40 more Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 44360 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657) ... 47 more And it took me all of two seconds to realize what had gone wrong. Now I'm just trying to figure out how to index the text content without truncating all the info or filtering it out entirely, thereby messing up my searching capabilities. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Limit indexed documents.
tomas.kalas kala...@email.cz wrote: Existing some hardware or software limits for indexing data? The only really hard Solr limit is 2 billion X per shard, where X is document count, unique values in a DocValues String field and other things like that. There are some softer limits, after which performance degrades markedly: Number of fields (hundreds are fine, millions are unrealistic), number of shards (avoid going into the thousands). Having a Java heap of hundreds of gigabytes is possible, but requires tweaking to avoid very long garbage collection pauses. I do not know of a byte size limit for shards: Shards of 1-2 TB works without problems on fitting hardware. And is some maximum of indexed documents? While the limit is 2 billion per single shard, SolrCloud does not have this limitation. A soft limit before doing some custom multi-level setup would thus be around 2000 billion documents, divided across 1000 shards. - Toke Eskildsen
RE: How to do a Data sharding for data in a database table
Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding. However, I am also open to solution that can improve the performance of the index of 13M to 14M size so that I do not need to do a data sharding. On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: You've repeated your original statement. Shawn's observation is that 10M docs is a very small corpus by Solr standards. You either have very demanding document/search combinations or you have a poorly tuned Solr installation. On reasonable hardware I expect 25-50M documents to have sub-second response time. So what we're trying to do is be sure this isn't an XY problem, from Hossman's apache page: Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 So again, how would you characterize your documents? How many fields? What do queries look like? How much physical memory on the machine? How much memory have you allocated to the JVM? You might review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote: The query without load is still under 1 second. But under load, response time can be much longer due to the queued up query. We would like to shard the data to something like 6 M / shard, which will still give a under 1 second response time under load. What are some best practice to shard the data? for example, we could shard the data by date range, but that is pretty dynamic, and we could shard data by some other properties, but if the data is not evenly distributed, you may not be able shard it anymore. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data- in-a-database-table-tp4212765p4212803.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: Error when submitting PDF to Solr w/text fields using SolrJ
Silly thing … Maybe the immense token was generating because trying to set string as field type for your text ? Can be ? Can you wipe out the index, set a proper type for your text, and index again ? No worries about the not full stack trace, We learn and do wrong things everyday :) Errare humanum est Cheers 2015-06-19 14:31 GMT+01:00 Paden rumsey...@gmail.com: Yeah I'm just gonna say hands down this was a totally bad question. My fault, mea culpa. I'm pretty new to working in an IDE environment and using a stack trace (I just finished my first year of CS at University and now I'm interning). I'm actually kind of embarrassed by how long it took me to realize I wasn't looking at the entire stack trace. Idiot moment of the week for sure. Thanks for the patience guys but when I looked at the entire stack trace it gave me this. Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=text (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[84, 104, 101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101, 112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message: bytes can be at most 32766 in length; got 44360 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) ... 40 more Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 44360 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657) ... 47 more And it took me all of two seconds to realize what had gone wrong. Now I'm just trying to figure out how to index the text content without truncating all the info or filtering it out entirely, thereby messing up my searching capabilities. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
RE: How to do a Data sharding for data in a database table
Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier). But this approach is unlikely to work for most cases. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 9:52 AM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding. However, I am also open to solution that can improve the performance of the index of 13M to 14M size so that I do not need to do a data sharding. On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: You've repeated your original statement. Shawn's observation is that 10M docs is a very small corpus by Solr standards. You either have very demanding document/search combinations or you have a poorly tuned Solr installation. On reasonable hardware I expect 25-50M documents to have sub-second response time. So what we're trying to do is be sure this isn't an XY problem, from Hossman's apache page: Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 So again, how would you characterize your documents? How many fields? What do queries look like? How much physical memory on the machine? How much memory have you allocated to the JVM? You might review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote: The query without load is still under 1 second. But under load, response time can be much longer due to the queued up query. We would like to shard the data to something like 6 M / shard, which will still give a under 1 second response time under load. What are some best practice to shard the data? for example, we could shard the data by date range, but that is pretty dynamic, and we could shard data by some other properties, but if the data is not evenly distributed, you may not be able shard it anymore. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
Re: understanding collapsingQParser with facet vs group.facet
On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? I'm not familiar with group.facet. But to compare traditional grouping to the collapsingQParser - in traditional grouping, all matching documents remain in the result set, but they are grouped for output purposes. However, the collapsingQParser is actually a query filter. It will reduce the number of matching results. Any faceting that happens will happen on the filtered results. I wonder if you can use this syntax to achieve faceting alongside collapsing: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This way, you get the benefits of the CollapsingQParserPlugin, with full faceting on the uncollapsed resultset. I've no idea how this would perform, but I'd expect it to be better than the grouping option. Upayavira
Re: Auto-suggest in Solr
Actually the documentation is not clear enough. Let's try to understand this suggester. *Building* This suggester build a FST that it will use to provide the autocomplete feature running prefix searches on it . The terms it uses to generate the FST are the tokens produced by the suggestFreeTextAnalyzerFieldType . And this should be correct. So if we have a shingle token filter[1-3] ( we produce unigrams as well) in our analysis to keep it simple , from these original field values : mp3 ipod mp3 player mp3 player ipod player of Real - we produce these list of possible suggestions in our FST : mp3 player ipod real of mp3 ipod mp3 player player ipod player of of real mp3 player ipod player of real From the documentation I read : ngrams: The max number of tokens out of which singles will be make the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions. This makes me confused, as I was not expecting this param to affect the suggestion dictionary. So I would like a clarification here from our masters :) At this point let's see what happens at query time . *Query Time * As my understanding the ngrams params will consider the last N-1 tokens the user put separated by the space separator. Builds an ngram model from the text sent to {@link * #build} and predicts based on the last grams-1 tokens in * the request sent to {@link #lookup}. This tries to * handle the long tail of suggestions for when the * incoming query is a never before seen query string. Example , grams=3 should consider only the last 2 tokens special mp3 p - mp3 p Then this query is analysed using the suggestFreeTextAnalyzerFieldType . We produce 3 tokens : mp3 p mp3 p And we run the prefix matching on the FST . *Conclusion* My understanding is wrong for sure at some point, as the behaviour I get is different. Can we discuss this , clarify this and eventually put it in the official documentation ? Cheers 2015-06-19 6:40 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm implementing an auto-suggest feature in Solr, and I'll like to achieve the follwing: For example, if the user enters mp3, Solr might suggest mp3 player, mp3 nano and mp3 music. When the user enters mp3 p, the suggestion should narrow down to mp3 player. Currently, when I type mp3 p, the suggester is returning words that starts with the letter p only, and I'm getting results like plan, production, etc, and it does not take the mp3 token into consideration. I'm using Solr 5.1 and below is my configuration: In solrconfig.xml: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=lookupImplFreeTextLookupFactory/str str name=indexPathsuggester_freetext_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=weightFieldProject/str str name=suggestFreeTextAnalyzerFieldTypesuggestType/str int name=ngrams5/int str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent In schema.xml fieldType name=suggestType class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^a-zA-Z0-9] replacement= / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=6 outputUnigrams=false/ /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^a-zA-Z0-9] replacement= / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=6 outputUnigrams=true/ /analyzer /fieldType Is there anything that I configured wrongly? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: understanding collapsingQParser with facet vs group.facet
The CollapsingQParserPlugin currently doesn't calculate facets at all. It simply collapses the document set. The facets are then calculated only on the group heads. Grouping has special faceting code built into it that supports the group.facet functionality. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 6:20 AM, Derek Poh d...@globalsources.com wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? Thank you, Derek
Re: understanding collapsingQParser with facet vs group.facet
Unfortunately this won't give you group.facet results: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This will give you the expanded facet counts as it removes the collapse filter. A good explanation of group.facets is here: http://blog.trifork.com/2012/04/10/faceting-result-grouping/ Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 7:11 AM, Upayavira u...@odoko.co.uk wrote: On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? I'm not familiar with group.facet. But to compare traditional grouping to the collapsingQParser - in traditional grouping, all matching documents remain in the result set, but they are grouped for output purposes. However, the collapsingQParser is actually a query filter. It will reduce the number of matching results. Any faceting that happens will happen on the filtered results. I wonder if you can use this syntax to achieve faceting alongside collapsing: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This way, you get the benefits of the CollapsingQParserPlugin, with full faceting on the uncollapsed resultset. I've no idea how this would perform, but I'd expect it to be better than the grouping option. Upayavira
Error: Could not create instance of 'SolrInputDocument'
We are running PaperThin's CommonSpot CMS in a Cold Fusion 10 and MS SQL Server 2008 R2 environment. We're using Apache Solr 4.10.4 vice Cold Fusion's Solr. We can create (and delete) collections through the CS CMS; they appear in (and disappear from) both the physical file structure as well as the Apache Solr dashboard. When we try indexing a collection through our CS CMS, it appears that each member is being indexed, however, each member errors out [Error.see logs] and indexing continues to the next member, only to error out again, etc., etc., etc. Eventually the entire collection is indexed in this fashion, and we received a message that the collection has been indexed and optimized. Our keyword search fails, returning 0 results. Our log files show entries for each member indexed: Error: Could not create instance of 'SolrInputDocument'. ~~ Exception: org.apache.solr.common.SolrInputDocument We're obiously missing something, but this is our first time using Apache Solr and aren't sure where things may be broken. Many thanks for any/all recommendations/guidance. Thanks! Paul R.
Re: Error when submitting PDF to Solr w/text fields using SolrJ
I definitely agree with Erick, the stack trace you posted is not complete again. This is an example of the same problem you got with a complete, meaningful stack trace : Stacktrace you provided : org.apache.solr.common.SolrException: Exception writing document id 12345 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:870) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1024) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:693) … -- Important stack trace follows !! Caused by: java.lang.IllegalArgumentException: input AttributeSource must not be null at org.apache.lucene.util.AttributeSource.init(AttributeSource.java:94) at org.apache.lucene.analysis.TokenStream.init(TokenStream.java:106) at org.apache.lucene.analysis.TokenFilter.init(TokenFilter.java:33) at org.apache.lucene.analysis.util.FilteringTokenFilter.init(FilteringTokenFilter.java:70) at org.apache.lucene.analysis.core.StopFilter.init(StopFilter.java:60) at org.apache.lucene.analysis.core.StopFilterFactory.create(StopFilterFactory.java:127) at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67) at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:102) at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180) at org.apache.lucene.document.Field.tokenStream(Field.java:554) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:597) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:222) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) ... 35 more , If you give us all the stack trace, I am pretty sure we can help . Cheers 2015-06-19 5:31 GMT+01:00 Erick Erickson erickerick...@gmail.com: The stack trace is what gets returned to the client, right? It's often much more informative to see the Solr log output, the error message is often much more helpful there. By the time the exception bubbles up through the various layers vital information is sometimes not returned to the client in the error message. One precaution I would take since you've changed the schema is to _completely_ remove the index. 1 shut down Solr 2 rm -rf coreX/data 3 restart Solr. 4 try it again. Lucene doesn't really care at all whether a field gets indexed one way in one document and another way in the next document and occasionally having fields indexed different ways (string and text) in different documents at the same time confuses things. Best, Erick On Thu, Jun 18, 2015 at 10:31 AM, Paden rumsey...@gmail.com wrote: Just rolling out a little bit more information as it is coming. I changed the field type in the schema to text_general and that didn't change a thing. Another thing is that it's consistently submitting/not submitting the same documents. I will run over it one time and it won't index a set of documents. When I clear the index and run the program again it submits/doesn't submit the same documents. And it will index certain PDF's it just won't index others. Which is weird because I printed the strings that are submitted to Solr and the ones that get submitted are really similar to the ones that aren't submitted. I can't post the actual strings for sensitivity reasons. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212757.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: understanding collapsingQParser with facet vs group.facet
The AnalyticsQuery can be used to implement custom faceting modules. This would allow you to calculate facets counts in an algorithm similar to group.facets before the result set is collapsed. If you are in distributed mode you will also need to implement a merge strategy: http://heliosearch.org/solrs-new-analyticsquery-api/ http://heliosearch.org/solrs-mergestrategy/ Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 7:28 AM, Joel Bernstein joels...@gmail.com wrote: Unfortunately this won't give you group.facet results: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This will give you the expanded facet counts as it removes the collapse filter. A good explanation of group.facets is here: http://blog.trifork.com/2012/04/10/faceting-result-grouping/ Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 19, 2015 at 7:11 AM, Upayavira u...@odoko.co.uk wrote: On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote: Hi I read about collapsingQParser returns the facet count the same as group.truncate=true and has this issue with the facet count and the after filter facet count notthe same. Using group.facetdoes not has this issue but it's performance is very badcompared to collapsingQParser. I trying to understand why collapsingQParser behave this way and will need to explain to management. Can someone explain how collapsingQParser calculatethefacet countscompated to group.facet? I'm not familiar with group.facet. But to compare traditional grouping to the collapsingQParser - in traditional grouping, all matching documents remain in the result set, but they are grouped for output purposes. However, the collapsingQParser is actually a query filter. It will reduce the number of matching results. Any faceting that happens will happen on the filtered results. I wonder if you can use this syntax to achieve faceting alongside collapsing: q=whatever fq={!collapse tag=collapse}blah facet.field={!ex=collapse}my_facet_field This way, you get the benefits of the CollapsingQParserPlugin, with full faceting on the uncollapsed resultset. I've no idea how this would perform, but I'd expect it to be better than the grouping option. Upayavira
Re: How to do a Data sharding for data in a database table
I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding. However, I am also open to solution that can improve the performance of the index of 13M to 14M size so that I do not need to do a data sharding. On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: You've repeated your original statement. Shawn's observation is that 10M docs is a very small corpus by Solr standards. You either have very demanding document/search combinations or you have a poorly tuned Solr installation. On reasonable hardware I expect 25-50M documents to have sub-second response time. So what we're trying to do is be sure this isn't an XY problem, from Hossman's apache page: Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 So again, how would you characterize your documents? How many fields? What do queries look like? How much physical memory on the machine? How much memory have you allocated to the JVM? You might review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wwang...@gmail.com wrote: The query without load is still under 1 second. But under load, response time can be much longer due to the queued up query. We would like to shard the data to something like 6 M / shard, which will still give a under 1 second response time under load. What are some best practice to shard the data? for example, we could shard the data by date range, but that is pretty dynamic, and we could shard data by some other properties, but if the data is not evenly distributed, you may not be able shard it anymore. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html Sent from the Solr - User mailing list archive at Nabble.com.
Migration from Solr 4.7.1 to SolrCloud 5.1
Hi. I have an old index running on a standalone Solr 4.7.1 and I have to migrate its index to my new SolrCloud 5.1 installation. I'm looking for some way to do this but I'm a little confused. Could you help me please? Thank you very much! Bye
Re: ZooKeeper connection refused
2015-06-17 16:11 GMT+02:00 Shalin Shekhar Mangar shalinman...@gmail.com: Is ZK healthy? Can you try the following from the server on which Solr is running: echo ruok | nc zk1 2181 Thank you very much Shalin for your answer! My ZK cluster was not ready because two nodes was dead and only one node was running. I fixed the two nodes and now all works good. Thank you very much!
RE: Solr Logging
Framework way? Maybe try delving into the log4j framework and modify the log4j.properties file. You can generate different log files based upon what class generated the message. Here's an example that I experimented with previously, it generates an update log, and 2 different query logs with slightly different information about each query. Adding a component to each requestHandler dedicated to logging might be the best way, but that might not qualify as a framework way, and I've never tried anything like that, so don't know how easy it might be. Just sending the relevant lines from log4j.properties, excluding the lines that are there by default. # Logger for updates log4j.logger.org.apache.solr.update.processor.LogUpdateProcessor=INFO, Updates #- size rotation with log cleanup. log4j.appender.Updates=org.apache.log4j.RollingFileAppender log4j.appender.Updates.MaxFileSize=4MB log4j.appender.Updates.MaxBackupIndex=9 #- File to log to and log format log4j.appender.Updates.File=${solr.log}/solr_Updates.log log4j.appender.Updates.layout=org.apache.log4j.PatternLayout log4j.appender.Updates.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n # Logger for queries, using SolrDispatchFilter log4j.logger.org.apache.solr.servlet.SolrDispatchFilter=DEBUG, queryLog1 #- size rotation with log cleanup. log4j.appender.queryLog1=org.apache.log4j.RollingFileAppender log4j.appender.queryLog1.MaxFileSize=4MB log4j.appender.queryLog1.MaxBackupIndex=9 #- File to log to and log format log4j.appender.queryLog1.File=${solr.log}/solr_queryLog1.log log4j.appender.queryLog1.layout=org.apache.log4j.PatternLayout log4j.appender.queryLog1.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n # Logger for queries, using SolrCore log4j.logger.org.apache.solr.core.SolrCore=INFO, queryLog2 #- size rotation with log cleanup. log4j.appender.queryLog2=org.apache.log4j.RollingFileAppender log4j.appender.queryLog2.MaxFileSize=4MB log4j.appender.queryLog2.MaxBackupIndex=9 #- File to log to and log format log4j.appender.queryLog2.File=${solr.log}/solr_queryLog2.log log4j.appender.queryLog2.layout=org.apache.log4j.PatternLayout log4j.appender.queryLog2.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n -Original Message- From: rbkumar88 [mailto:rbkuma...@gmail.com] Sent: Thursday, June 18, 2015 10:41 AM To: solr-user@lucene.apache.org Subject: Solr Logging Hi, I want to log Solr search queries/response time and Solr indexing log separately in different set of log files. Is there any convenient framework/way to do it. Thanks Bharath -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Logging-tp4212730.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to append new data to index i solr?
It does. Absolutely. But it depends on what you in it. Start from http://wiki.apache.org/solr/UpdateXmlMessages#add.2Freplace_documents On Fri, Jun 19, 2015 at 7:54 AM, 步青云 mailliup...@qq.com wrote: Hello, I'm a solr user with some question. I want to append new data to the existing index. Does Solr support to append new data to index? Thanks for any reply. Best wishes. Jason -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr 5.2.1 on Solaris
Please open a JIRA with details of what the issues are, we should try to support this.. On 18 Jun 2015 15:07, Bence Vass bence.v...@inso.tuwien.ac.at wrote: Hello, Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris 10)? The script (solr start) doesn't work out of the box, is anyone running Solaris 5.x on Solaris? - Thanks
Distributed Search component question
Hi all, I have the following search components that I don't have a solution at the moment to get them working in distributed mode on solr 4.10.4. [standard query component] [search component-1] (StageID - 2500): handleResponses: get few values from docs and populate parameters for stats component and set some metadata in the ResponseBuilder rb.rsp.add(metadata, NamedList...) distributedProcess: rb.doFacets=false; if (rb.stage StageID) if( null == rb.rsp[metadata] ) { return StageID; } return component-2.StageID [search component-2] (StageID - 2800): distributedProcess: rb.doFacets=true; formatAndSet some facetParams based on rb.rsp[metadata] return ResponseBuilder.STAGE_GET_FIELDS [standard facet component]: Things seem to work fine between component-1 and component-2, I just can't prevent facets from running until component-2 sets proper facet params. And than facet component sets the rb._facetInfo to null. Should I move my logic in component-2 from distributeProcess to handleResponses and modify ShardRequest and set rb.addRequest? Any hints are much appreciated. Mihran
Re: Error when submitting PDF to Solr w/text fields using SolrJ
So, the first I can say is if that is true : it almost killed Solr with 280 files you are doing something wrong for sure. At least if you are not trying to index 4k full movies xD Joking apart : 1) You should carefully design your analyser. 2) You should store your fields initially to verify you index what you were supposed to ( in number and in content) Assuming you are a beginner storing the fields will make easier for you to check, as they will pop out of the results. is at least the number of docs indexed correct ? 2015-06-19 15:34 GMT+01:00 Paden rumsey...@gmail.com: Yeah, actually changing the field to text_en or text_en_splitting actually made it so my indexer indexed all my files. The only problem is, I don't think it's doing it well. I have two Cores that I'm working with. Both of them have indexed the same set of files. The first core, which I will refer to as Testcore, I used a DIH configuration that indexed the files with their metadata. (It indexed everything fine but it almost killed Solr with 280 files I would hate to see what would happen with say, 10,000 files.). When I query Testcore on some random common word like a it returns like 279 files. A good margin I can accept that. The second core, which I will refer to as Testcore2, I used my own indexer that I created and use SolrJ as the client. It indexes everything. However, when I query on the same word a it only returns 208 of the 281 files. Which is weird cause I'm using the exact same Querying handler for both. So I don't think a comprehensive indexed text is being sent to Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Error when submitting PDF to Solr w/text fields using SolrJ
Yeah, actually changing the field to text_en or text_en_splitting actually made it so my indexer indexed all my files. The only problem is, I don't think it's doing it well. I have two Cores that I'm working with. Both of them have indexed the same set of files. The first core, which I will refer to as Testcore, I used a DIH configuration that indexed the files with their metadata. (It indexed everything fine but it almost killed Solr with 280 files I would hate to see what would happen with say, 10,000 files.). When I query Testcore on some random common word like a it returns like 279 files. A good margin I can accept that. The second core, which I will refer to as Testcore2, I used my own indexer that I created and use SolrJ as the client. It indexes everything. However, when I query on the same word a it only returns 208 of the 281 files. Which is weird cause I'm using the exact same Querying handler for both. So I don't think a comprehensive indexed text is being sent to Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error: Could not create instance of 'SolrInputDocument'
On 6/19/2015 5:40 AM, Paul Revere wrote: Our log files show entries for each member indexed: Error: Could not create instance of 'SolrInputDocument'. ~~ Exception: org.apache.solr.common.SolrInputDocument There will be a *lot* more detail available on this exception. We will need all of it, including all caused by information. It can be dozens of lines and include multiple caused by clauses, each of which will have a stacktrace. Your message indicates that it is Solr 4.10.4 ... hopefully it is unmodified. That information is critical in comparing the exception stacktrace to the source code. There might also be additional information in the logs that is immediately before or after this message. You might need to go to the Solr logfile instead of your application's logfile for more information. http://wiki.apache.org/solr/UsingMailingLists Thanks, Shawn
Re: Migration from Solr 4.7.1 to SolrCloud 5.1
You really have to ask more specific questions here. What are you confused _about_? Have you gone through the tutorial? Read the Solr In Action book? Tried _anything_? Best, Erick On Fri, Jun 19, 2015 at 5:02 AM, shacky shack...@gmail.com wrote: Hi. I have an old index running on a standalone Solr 4.7.1 and I have to migrate its index to my new SolrCloud 5.1 installation. I'm looking for some way to do this but I'm a little confused. Could you help me please? Thank you very much! Bye
Re: How to do a Data sharding for data in a database table
First and most obvious thing to try: bq: the Solr was started with maximal 4G for JVM, and index size is 2G Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very loosely coupled to JVM requirements. It's quite possible that you're spending all your time in GC cycles. Consider gathering GC characteristics, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ As Charles says, on the face of it the system you describe should handle quite a load, so it feels like things can be tuned and you won't have to resort to sharding. Sharding inevitably imposes some overhead so it's best to go there last. From my perspective, this is, indeed, an XY problem. You're assuming that sharding is your solution. But you really haven't identified the _problem_ other than queries are too slow. Let's nail down the reason queries are taking a second before jumping into sharding. I've just spent too much of my life fixing the wrong thing ;) It would be useful to see a couple of sample queries so we can get a feel for how complex they are. Especially if you append, as Charles mentions, debug=true Best, Erick On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier). But this approach is unlikely to work for most cases. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 9:52 AM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding. However, I am also open to solution that can improve the performance of the index of 13M to 14M size so that I do not need to do a data sharding. On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: You've repeated your original statement. Shawn's observation is that 10M docs is a very small corpus by Solr standards. You either have very demanding document/search combinations or you have a poorly tuned Solr installation. On reasonable hardware I expect 25-50M documents to have sub-second response time. So what we're trying to do is be sure this isn't an XY problem, from Hossman's apache page: Your question appears to be an XY Problem ... that is: you are dealing with
Re: Error when submitting PDF to Solr w/text fields using SolrJ
You really, really, really want to get friendly with the admin/analysis page for questions like: bq: You're probably right though. I probably have to create a better analyzer really ;). It shows you exactly what each link in your analysis chain does to the input. Perhaps 75% or the questions about why am I getting the results I'm seeing are answered there IMO. Best, Erick On Fri, Jun 19, 2015 at 9:38 AM, Paden rumsey...@gmail.com wrote: Yes the number of indexed documents is correct. But the queries I perform fall short of what they should be. You're probably right though. I probably have to create a better analyzer. And I'm not really worried about the other fields. I've already check to see if it's storing them correctly and it is. I'm mostly worried about the text fields and how they're being indexed by Solr when submitted. BTW: Because of your comment, I went back and checked my core that used the DIH configuration. I increased the RAM on the Linux virtual machine I'm using and it worked like a dream. Thanks! You might have just helped me finish this project. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Migration from Solr 4.7.1 to SolrCloud 5.1
2015-06-19 18:00 GMT+02:00 Erick Erickson erickerick...@gmail.com: You really have to ask more specific questions here. What are you confused _about_? Have I read that I could migrate using the backup script, so I looked for the backup script in the Solr 4.7.1 source code but I haven't find anything...
Re: Error when submitting PDF to Solr w/text fields using SolrJ
Yes the number of indexed documents is correct. But the queries I perform fall short of what they should be. You're probably right though. I probably have to create a better analyzer. And I'm not really worried about the other fields. I've already check to see if it's storing them correctly and it is. I'm mostly worried about the text fields and how they're being indexed by Solr when submitted. BTW: Because of your comment, I went back and checked my core that used the DIH configuration. I increased the RAM on the Linux virtual machine I'm using and it worked like a dream. Thanks! You might have just helped me finish this project. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error when submitting PDF to Solr w/text fields using SolrJ
This may be another forehead-slapper (man, you don't know how often I've injured myself that way). Did you commit at the end of the SolrJ indexing to Testcore2? DIH automatically commits at the end of the run, and depending on how your SolrJ program is written it may not have. Or just set autoCommit (with openSearcher=true) in your solrconfig file. Or set autoSoftCommit there. In either case, wait until the interval has expired after your indexing has run. Or, for that matter, you can insure you've committed by using curl or just entering something like /Testcore2/update?commit=true in a url. And another one that'll make you cringe is if your SolrJ program looks like: while (more docs) { create a solr doc and add it to my list if (list 100) { send list to Solr clear list } } end of program. As the program exits, there'll still be docs in the list that haven' been sent to Solr. Alessandro's question hints at things like this, the question is whether the doc is all the docs got sent to Solr or not. Second question is whether they're analyzed differently in the two cores. Third question Best, Erick On Fri, Jun 19, 2015 at 8:32 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: So, the first I can say is if that is true : it almost killed Solr with 280 files you are doing something wrong for sure. At least if you are not trying to index 4k full movies xD Joking apart : 1) You should carefully design your analyser. 2) You should store your fields initially to verify you index what you were supposed to ( in number and in content) Assuming you are a beginner storing the fields will make easier for you to check, as they will pop out of the results. is at least the number of docs indexed correct ? 2015-06-19 15:34 GMT+01:00 Paden rumsey...@gmail.com: Yeah, actually changing the field to text_en or text_en_splitting actually made it so my indexer indexed all my files. The only problem is, I don't think it's doing it well. I have two Cores that I'm working with. Both of them have indexed the same set of files. The first core, which I will refer to as Testcore, I used a DIH configuration that indexed the files with their metadata. (It indexed everything fine but it almost killed Solr with 280 files I would hate to see what would happen with say, 10,000 files.). When I query Testcore on some random common word like a it returns like 279 files. A good margin I can accept that. The second core, which I will refer to as Testcore2, I used my own indexer that I created and use SolrJ as the client. It indexes everything. However, when I query on the same word a it only returns 208 of the 281 files. Which is weird cause I'm using the exact same Querying handler for both. So I don't think a comprehensive indexed text is being sent to Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: CREATE collection bug or feature?
On 6/19/2015 11:15 AM, Jim.Musil wrote: I noticed that when I issue the CREATE collection command to the api, it does not automatically put a replica on every live node connected to zookeeper. So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and create a collection like this: /admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config It will only create a core on one of the three nodes. I can make it work if I change replicationFactor to 3. When standing up an entire stack using chef, this all gets a bit clunky. I don't see any option such as ALL that would just create a replica on all nodes regardless of size. I'm guessing this is intentional, but curious about the reasoning. If you tell it replicationFactor=1, then you get exactly that -- one copy of your index. I personally think that it would be a violation of something known as the principle of least surprise for Solr to automatically create replicas without being asked to. I would assume that if you are writing automated tools to build indexes and the servers hosting those indexes that your automation will be able to calculate a reasonable replicationFactor, or calculate the number of hosts to create based on a provided replicationFactor. A feature to have Solr itself automatically calculate a replicationFactor based on the number of available hosts and the numShards value provided is not a bad idea. Please create a feature request issue in Jira. One way that this might be done is by setting replicationFactor to auto or maybe a special number, perhaps 0 or -1. https://issues.apache.org/jira/browse/SOLR Thanks, Shawn
Re: CREATE collection bug or feature?
Jim: This is by design. There's no way to tell Solr to find all the cores available and put one replica on each. In fact, you're explicitly telling it to create one and only one replica, one and only one shard. That is, your collection will have exactly one low-level core. But you realized that... As to the reasoning. Consider hetergeneous collections all hosted on the same Solr cluster. I have big collections, little collections, some with high QPS rates, some not. etc. Having Solr do things like this automatically would make managing this difficult. Probably the real reason is nobody thought it would be useful in the general case. And I probably concur. Adding a new node to an existing cluster would result in unbalanced clusters etc. I suppose a stop-gap would be to query the live_nodes in the cluster and add that to the URL, don't know how much of a pain that would be though. Best, Erick On Fri, Jun 19, 2015 at 10:15 AM, Jim.Musil jim.mu...@target.com wrote: I noticed that when I issue the CREATE collection command to the api, it does not automatically put a replica on every live node connected to zookeeper. So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and create a collection like this: /admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config It will only create a core on one of the three nodes. I can make it work if I change replicationFactor to 3. When standing up an entire stack using chef, this all gets a bit clunky. I don't see any option such as ALL that would just create a replica on all nodes regardless of size. I'm guessing this is intentional, but curious about the reasoning. Thanks! Jim
Re: How to do a Data sharding for data in a database table
As for now, the index size is 6.5 M records, and the performance is good enough. I will re-build the index for all the records (14 M) and test it again with debug turned on. Thanks On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com wrote: First and most obvious thing to try: bq: the Solr was started with maximal 4G for JVM, and index size is 2G Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very loosely coupled to JVM requirements. It's quite possible that you're spending all your time in GC cycles. Consider gathering GC characteristics, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ As Charles says, on the face of it the system you describe should handle quite a load, so it feels like things can be tuned and you won't have to resort to sharding. Sharding inevitably imposes some overhead so it's best to go there last. From my perspective, this is, indeed, an XY problem. You're assuming that sharding is your solution. But you really haven't identified the _problem_ other than queries are too slow. Let's nail down the reason queries are taking a second before jumping into sharding. I've just spent too much of my life fixing the wrong thing ;) It would be useful to see a couple of sample queries so we can get a feel for how complex they are. Especially if you append, as Charles mentions, debug=true Best, Erick On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier). But this approach is unlikely to work for most cases. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 9:52 AM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing data sharding. However, I am also open to solution that can improve the performance of the index of 13M to 14M size so that I do not need to do a data sharding. On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: You've repeated your original statement. Shawn's observation is that 10M docs is a very small corpus by Solr standards. You either have very demanding document/search
CREATE collection bug or feature?
I noticed that when I issue the CREATE collection command to the api, it does not automatically put a replica on every live node connected to zookeeper. So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and create a collection like this: /admin/collections?action=CREATEname=my_collectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.configName=my_config It will only create a core on one of the three nodes. I can make it work if I change replicationFactor to 3. When standing up an entire stack using chef, this all gets a bit clunky. I don't see any option such as ALL that would just create a replica on all nodes regardless of size. I'm guessing this is intentional, but curious about the reasoning. Thanks! Jim
RE: Extended Dismax Query Parser with AND as default operator
Dirk, There are 3 open JIRAs related to this behavior: https://issues.apache.org/jira/browse/SOLR-3739 https://issues.apache.org/jira/browse/SOLR-3740 https://issues.apache.org/jira/browse/SOLR-3741 We worked around it by adding the explicit + signs if the query matched the problematic patterns. A pain, I know. -Original Message- From: Dirk Buchhorn [mailto:dirk.buchh...@finkundpartner.de] Sent: Thursday, June 18, 2015 3:31 AM To: solr-user@lucene.apache.org Subject: Extended Dismax Query Parser with AND as default operator Hello, I have a question to the extended dismax query parser. If the default operator is changed to AND (q.op=AND) then the search results seems to be incorrect. I will explain it on some examples. For this test I use solr v5.1 and the tika core from the example directory. == Preparation == Add the following lines to the schema.xml file field name=id type=string indexed=true stored=true required=true/ uniqueKeyid/uniqueKey Change the field text to stored=true Remove the multiValued attribute from the title and text field (we don't need multivaled fields in our test) Add test data (use curl or fiddler) Url:http://localhost:8983/solr/tika/update/json?commit=true Header: Content-type: application/json [ {id:1, title:green, author:Jon, text:blue}, {id:2, title:green, author:Jon Jessie, text:red}, {id:3, title:yellow, author:Jessie, text:blue}, {id:4, title:green, author:Jessie, text:blue}, {id:5, title:blue, author:Jon, text:yellow}, {id:6, title:red, author:Jon, text:green} ] == Test == The following parameter are always set. default operator is AND: q.op=AND use the extended dismax query parser: defType=edismax set the default query fields to title and text: qf=title text sort: id asc === #1 test === q=red green response: { numFound:2,start:0, docs:[ {id:2,title:green,author:Jon Jessie,text:red}, {id:6,title:red,author:Jon,text:green}] } parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2) This test works as expected. === #2 test === We use a group q=(red green) Same response as test one. parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2) This test works as expected. === #3 test === q=green red author:Jessie response: { numFound:1,start:0, docs:[{id:2,title:green,author:Jon Jessie,text:red}] } parsedquery_toString: +(((text:green | title:green) (text:red | title:red) author:jessie)~3) This test works as expected. === #4 test === q=(green red) author:Jessie response: { numFound:2,start:0, docs:[ {id:2,title:green,author:Jon Jessie,text:red}, {id:4,title:green,author:Jessie,text:blue}] } parsedquery_toString: +text:green | title:green) (text:red | title:red)) author:jessie)~2) The same result as the 3th test was expected. Why no AND is used for the query group? === #5 test === q=(+green +red) author:Jessie response: { numFound:4,start:0, docs:[ {id:2,title:green,author:Jon Jessie,text:red}, {id:3,title:yellow,author:Jessie,text:blue}, {id:4,title:green,author:Jessie,text:blue}, {id:6,title:red,author:Jon,text:green}] } parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) author:jessie) Now AND is used for the group but the author is concatenated with OR. Why? === #6 test === q=(+green +red) +author:Jessie response: { numFound:3,start:0, docs:[ {id:2,title:green,author:Jon Jessie,text:red}, {id:3,title:yellow,author:Jessie,text:blue}, {id:4,title:green,author:Jessie,text:blue}] } parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) +author:jessie) Still not the expected result. === #7 test === q=+(+green +red) +author:Jessie response: { numFound:1,start:0, docs:[{id:2,title:green,author:Jon Jessie,text:red}] } parsedquery_toString: +(+(+(text:green | title:green) +(text:red | title:red)) +author:jessie) Now the result is ok. But if all operators must be given then q.op=AND is useless. === #8 test === q=green author:(Jon Jessie) Found four results, expected are one. The query must changed to '+green +author:(+Jon +Jessie)' to get the expected result. Is this a bug in the extended dismax parser or what is the reason for not consequently applying q.op=AND to the query expression? Kind regards Dirk Buchhorn
Re: CREATE collection bug or feature?
Thanks as always for the great answers! Jim On 6/19/15, 11:57 AM, Erick Erickson erickerick...@gmail.com wrote: Jim: This is by design. There's no way to tell Solr to find all the cores available and put one replica on each. In fact, you're explicitly telling it to create one and only one replica, one and only one shard. That is, your collection will have exactly one low-level core. But you realized that... As to the reasoning. Consider hetergeneous collections all hosted on the same Solr cluster. I have big collections, little collections, some with high QPS rates, some not. etc. Having Solr do things like this automatically would make managing this difficult. Probably the real reason is nobody thought it would be useful in the general case. And I probably concur. Adding a new node to an existing cluster would result in unbalanced clusters etc. I suppose a stop-gap would be to query the live_nodes in the cluster and add that to the URL, don't know how much of a pain that would be though. Best, Erick On Fri, Jun 19, 2015 at 10:15 AM, Jim.Musil jim.mu...@target.com wrote: I noticed that when I issue the CREATE collection command to the api, it does not automatically put a replica on every live node connected to zookeeper. So, for example, if I have 3 solr nodes connected to a zookeeper ensemble and create a collection like this: /admin/collections?action=CREATEname=my_collectionnumShards=1replicati onFactor=1maxShardsPerNode=1collection.configName=my_config It will only create a core on one of the three nodes. I can make it work if I change replicationFactor to 3. When standing up an entire stack using chef, this all gets a bit clunky. I don't see any option such as ALL that would just create a replica on all nodes regardless of size. I'm guessing this is intentional, but curious about the reasoning. Thanks! Jim
CollapseQParserPluging Incorrect Facet Counts
Hi, We are comparing results between Field Collapsing (group* parameters) and CollapseQParserPlugin. We noticed that some facets are returning incorrect counts. Here are the relevant parameters of one of our test queries: Field Collapsing: --- q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetgroup=truegroup.field=groupidgroup.facet=true group.ngroups=true ngroups = 5964 lst name=searchcolorfacet ... int name=red11/int ... /lst CollapseQParserPlugin: --q=red%20dressfacet=truefacet.mincount=1facet.limit=-1facet.field=searchcolorfacetfq=%7B!collapse%20field=groupid%7D numFound = 5964 (same) lst name=searchcolorfacet ... int name=red8/int ... /lst When we change the CollapseQParserPlugin query by adding fq=searchcolorfacet:red, the numFound value is 11, effectively showing all 11 hits with that color. The facet count for red now shows the correct value of 11 as well. Has anyone seeing something similar? Thanks, Carlos
RE: How to do a Data sharding for data in a database table
Also, since you are tuning for relative times, you can tune on the smaller index. Surely, you will want to test at scale. But tuning query, analyzer or schema options is usually easier to do on a smaller index. If you get a 3x improvement at small scale, it may only be 2.5x at full scale. E.g. storing the group field as doc values is one option that can help grouping performance in some cases (at least according to this list, I haven't tried it yet). The number of distinct values of the grouping field is important as well. If there are very many, you may want to try CollapsingQParserPlugin. The point being, some of these options may require reindexing! So, again, it is a much easier and faster process to tune on a smaller index. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, June 19, 2015 2:33 PM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table Do be aware that turning on debug=query adds a load. I've seen the debug component take 90% of the query time. (to be fair it usually takes a much smaller percentage). But you'll see a section at the end of the response if you set debug=all with the time each component took so you'll have a sense of the relative time used by each component. Best, Erick On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote: As for now, the index size is 6.5 M records, and the performance is good enough. I will re-build the index for all the records (14 M) and test it again with debug turned on. Thanks On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com wrote: First and most obvious thing to try: bq: the Solr was started with maximal 4G for JVM, and index size is 2G Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very loosely coupled to JVM requirements. It's quite possible that you're spending all your time in GC cycles. Consider gathering GC characteristics, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ As Charles says, on the face of it the system you describe should handle quite a load, so it feels like things can be tuned and you won't have to resort to sharding. Sharding inevitably imposes some overhead so it's best to go there last. From my perspective, this is, indeed, an XY problem. You're assuming that sharding is your solution. But you really haven't identified the _problem_ other than queries are too slow. Let's nail down the reason queries are taking a second before jumping into sharding. I've just spent too much of my life fixing the wrong thing ;) It would be useful to see a couple of sample queries so we can get a feel for how complex they are. Especially if you append, as Charles mentions, debug=true Best, Erick On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier). But this approach is unlikely to work for most cases. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 9:52 AM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list
Re: How to do a Data sharding for data in a database table
Do be aware that turning on debug=query adds a load. I've seen the debug component take 90% of the query time. (to be fair it usually takes a much smaller percentage). But you'll see a section at the end of the response if you set debug=all with the time each component took so you'll have a sense of the relative time used by each component. Best, Erick On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang wwang...@gmail.com wrote: As for now, the index size is 6.5 M records, and the performance is good enough. I will re-build the index for all the records (14 M) and test it again with debug turned on. Thanks On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson erickerick...@gmail.com wrote: First and most obvious thing to try: bq: the Solr was started with maximal 4G for JVM, and index size is 2G Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very loosely coupled to JVM requirements. It's quite possible that you're spending all your time in GC cycles. Consider gathering GC characteristics, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ As Charles says, on the face of it the system you describe should handle quite a load, so it feels like things can be tuned and you won't have to resort to sharding. Sharding inevitably imposes some overhead so it's best to go there last. From my perspective, this is, indeed, an XY problem. You're assuming that sharding is your solution. But you really haven't identified the _problem_ other than queries are too slow. Let's nail down the reason queries are taking a second before jumping into sharding. I've just spent too much of my life fixing the wrong thing ;) It would be useful to see a couple of sample queries so we can get a feel for how complex they are. Especially if you append, as Charles mentions, debug=true Best, Erick On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Grouping does tend to be expensive. Our regular queries typically return in 10-15ms while the grouping queries take 60-80ms in a test environment ( 1M docs). This is ok for us, since we wrote our app to take the grouping queries out of the critical path (async query in parallel with two primary queries and some work in middle tier). But this approach is unlikely to work for most cases. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Friday, June 19, 2015 9:52 AM To: solr-user@lucene.apache.org Subject: RE: How to do a Data sharding for data in a database table Hi Wenbin, To me, your instance appears well provisioned. Likewise, your analysis of test vs. production performance makes a lot of sense. Perhaps your time would be well spent tuning the query performance for your app before resorting to sharding? To that end, what do you see when you set debugQuery=true? Where does solr spend the time? My guess would be in the grouping and sorting steps, but which? Sometime the schema details matter for performance. Folks on this list can help with that. -Charlie -Original Message- From: Wenbin Wang [mailto:wwang...@gmail.com] Sent: Friday, June 19, 2015 7:55 AM To: solr-user@lucene.apache.org Subject: Re: How to do a Data sharding for data in a database table I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer disk bound. In addition, the Solr was started with maximal 4G for JVM, and index size is 2G. In a typical test, I made sure enough free RAM of 10G was available. I have not tuned any parameter in the configuration, it is default configuration. The number of fields for each record is around 10, and the number of results to be returned per page is 30. So the response time should not be affected by network traffic, and it is tested in the same machine. The query has a list of 4 search parameters, and each parameter takes a list of values or date range. The results will also be grouped and sorted. The response time of a typical single request is around 1 second. It can be 1 second with more demanding requests. In our production environment, we have 64 cores, and we need to support 300 concurrent users, that is about 300 concurrent request per second. Each core will have to process about 5 request per second. The response time under this load will not be 1 second any more. My estimate is that an average of 200 ms response time of a single request would be able to handle 300 concurrent users in production. There is no plan to increase the total number of cores 5 times. In a previous test, a search index around 6M data size was able to handle 5 request per second in each core of my 8-core machine. By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 M/each, I am expecting much faster response time that can meet the demand of production environment. That is the motivation of doing