Distributed Search component question
Hi all, I have the following search components that I don't have a solution at the moment to get them working in distributed mode on solr 4.10.4. [standard query component] [search component-1] (StageID - 2500): handleResponses: get few values from docs and populate parameters for stats component and set some metadata in the ResponseBuilder rb.rsp.add(metadata, NamedList...) distributedProcess: rb.doFacets=false; if (rb.stage StageID) if( null == rb.rsp[metadata] ) { return StageID; } return component-2.StageID [search component-2] (StageID - 2800): distributedProcess: rb.doFacets=true; formatAndSet some facetParams based on rb.rsp[metadata] return ResponseBuilder.STAGE_GET_FIELDS [standard facet component]: Things seem to work fine between component-1 and component-2, I just can't prevent facets from running until component-2 sets proper facet params. And than facet component sets the rb._facetInfo to null. Should I move my logic in component-2 from distributeProcess to handleResponses and modify ShardRequest and set rb.addRequest? Any hints are much appreciated. Mihran
PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour
I must be missing something obvious.I have a simple regex that removes spacehyphenspace pattern. The unit test below works fine, but when I plug it into schema and query, regex does not match, since input already gets split by space (further below). My understanding that charFilter would operate on raw input string and than pass it to the whitespace tokenizer which seems to be the case, but I am not sure why I get already split token stream. Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+), , reader); } }; final TokenStream tokens = analyzer.tokenStream(, new StringReader(a - b)); tokens.reset(); final CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class); while (tokens.incrementToken()) { System.out.println(=== + new String(Arrays.copyOf(termAtt.buffer(), termAtt.length(; } I end up with: === a === b Now I define the same in my schema: fieldType name=text class=solr.TextField positionIncrementGap=100 multiValued=true autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ; / tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer /fieldType field name=myfield type=text indexed=true stored=false multiValued=true/ When I query the input already comes in split into (e.g. a,-,b) PatternReplaceCharFilter's processPattern method so regex would not match. CharSequence processPattern(CharSequence input) ... even though charFilter is defined before tokenizer. Here is the query SolrQuery solrQuery = new SolrQuery(a - b); solrQuery.setRequestHandler(/select); solrQuery.set(defType, edismax); solrQuery.set(qf, myfield); solrQuery.set(CommonParams.ROWS, 0); solrQuery.set(CommonParams.DEBUG, true); solrQuery.set(CommonParams.DEBUG_QUERY, true); QueryResponse response = solrSvr.query(solrQuery); System.out.println(parsedQtoString + response.getDebugMap() .get(parsedquery_toString)); System.out.println(parsedQ + response.getDebugMap() .get(parsedquery)); Output is parsedQtoString +((myfield:a) (myfield:-) (myfield:b)) parsedQ (+(DisjunctionMaxQuery((myfield:a)) DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord
Re: Relevancy : Keyword stuffing
Thank you Markus and Chris, for pointers. For SweetSpotSimilarity I am thinking perhaps a set of closed ranges exposed via similarity config is easier to maintain as data changes than making adjustments to fit a function. Another piece of info would've been handy is to know the average position info + position info for the first few occurrences for each term. This would allow perhaps higher boosting for term occurrences earlier in the doc. In my case extra keywords are towards the end of the doc,but that info does not seem to be propagated into scorer. Thanks again, Mihran On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org wrote: You should start by checking out the SweetSpotSimilarity .. it was heavily designed arround the idea of dealing with things like excessively verbose titles, and keyword stuffing in summary text ... so you can configure your expectation for what a normal length doc is, and they will be penalized for being longer then that. similarly you can say what a 'resaonable' tf is, and docs that exceed that would't get added boost (which in conjunction with teh lengthNorm penality penalizes docs that stuff keywords) https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg -Hoss http://www.lucidworks.com/
Relevancy : Keyword stuffing
Hi all, I have a use case where the data is generated by SEO minded authors and more often than not they perfectly guess the synonym expansions for the document titles skewing results in their favor. At the moment I don't have an offline processing infrastructure to detect these (I can't punish these docs either... just have to level the playing field). I am experimenting with taking the max of the term scores, cutting off scores after certain number of terms,etc but would appreciate any hints if anyone has experience dealing with a similar use case in solr. Much appreciated, Mihran
boosting by geodist - GC Overhead Limit exceeded
I am running solr 4.10.2 with geofilt (~20% of docs have 30+ lat/lon points) and everything work hunky dori. Than I added a bf with geodist along the lines of: recip(geodist(),5,20,5) after few hours of running I end up with OOM GC overhead limit exceeded. I've seen this https://issues.apache.org/jira/browse/LUCENE-4698 and few other relevant tickets. Wanted to check if anyone has any successful remedies. Many thanks, Mihran My gc params on amazon xl instance: -server -Xmx8g -Xms8g -XX:+HeapDumpOnOutOfMemoryError \ -XX:NewRatio=3 \ -XX:SurvivorRatio=4 \ -XX:TargetSurvivorRatio=90 \ -XX:MaxTenuringThreshold=8 \ -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ -XX:+CMSScavengeBeforeRemark \ -XX:PretenureSizeThreshold=64m \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:CMSInitiatingOccupancyFraction=50 \ -XX:CMSMaxAbortablePrecleanTime=6000 \ -XX:+CMSParallelRemarkEnabled \ -XX:+ParallelRefProcEnabled Screenshot from Eclipse Mat [image: Inline image 1]