Distributed Search component question

2015-06-19 Thread Mihran Shahinian
Hi all,
I have the following search components that I don't have a solution at the
moment to get them working in distributed mode on solr 4.10.4.

[standard query component]
[search component-1] (StageID - 2500):
 handleResponses: get few values from docs and populate parameters for
stats component and set some metadata in the ResponseBuilder
  rb.rsp.add(metadata, NamedList...)

distributedProcess:
   rb.doFacets=false;
   if (rb.stage  StageID)
  if( null == rb.rsp[metadata] ) {
   return StageID;
   }
return component-2.StageID

[search component-2] (StageID - 2800):
distributedProcess:
   rb.doFacets=true;
   formatAndSet some facetParams based on rb.rsp[metadata]
   return ResponseBuilder.STAGE_GET_FIELDS

[standard facet component]:


Things seem to work fine between component-1 and component-2, I just can't
prevent facets from running
until component-2 sets proper facet params. And than facet component sets
the rb._facetInfo to null. Should I move my logic in component-2 from
distributeProcess to handleResponses and modify ShardRequest and set
rb.addRequest?

Any hints are much appreciated.
Mihran


PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour

2015-05-11 Thread Mihran Shahinian
I must be missing something obvious.I have a simple regex that removes
spacehyphenspace pattern.

The unit test below works fine, but when I plug it into schema and query,
regex does not match, since input already gets split by space (further
below). My understanding that charFilter would operate on raw input string
and than pass it to the whitespace tokenizer which seems to be the case,
but I am not sure why I get already split token stream.

Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String
fieldName,
 Reader reader)
{
Tokenizer tokenizer = new MockTokenizer(reader,

MockTokenizer.WHITESPACE,
false);
return new TokenStreamComponents(tokenizer,
 tokenizer);
}

@Override
protected Reader initReader(String fieldName,
Reader reader) {
return new
PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+),
 ,
reader);
}
};

final TokenStream tokens = analyzer.tokenStream(,  new
StringReader(a - b));
tokens.reset();
final CharTermAttribute termAtt =
tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(===  +
   new String(Arrays.copyOf(termAtt.buffer(),
termAtt.length(;
}

I end up with:
=== a
=== b


Now I define the same in my schema:

fieldType name=text class=solr.TextField positionIncrementGap=100
 multiValued=true autoGeneratePhraseQueries=false
analyzer  type=index
 tokenizer class=solr.WhitespaceTokenizerFactory
/
/analyzer
analyzer  type=query
charFilter
class=solr.PatternReplaceCharFilterFactory
pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ;
 /
tokenizer class=solr.WhitespaceTokenizerFactory /
/analyzer
/fieldType

field name=myfield type=text indexed=true stored=false
multiValued=true/

When I query the input already comes in split into (e.g. a,-,b)
PatternReplaceCharFilter's processPattern method so regex would not match.
CharSequence processPattern(CharSequence input) ...
even though charFilter is defined before tokenizer.




Here is the query
SolrQuery solrQuery = new SolrQuery(a - b);
solrQuery.setRequestHandler(/select);
solrQuery.set(defType,
  edismax);
solrQuery.set(qf,
  myfield);
solrQuery.set(CommonParams.ROWS,
  0);
solrQuery.set(CommonParams.DEBUG,
  true);
solrQuery.set(CommonParams.DEBUG_QUERY,
  true);
QueryResponse response = solrSvr.query(solrQuery);

System.out.println(parsedQtoString  +
   response.getDebugMap()
   .get(parsedquery_toString));
System.out.println(parsedQ  +
   response.getDebugMap()
   .get(parsedquery));

Output is
parsedQtoString +((myfield:a) (myfield:-) (myfield:b))
parsedQ (+(DisjunctionMaxQuery((myfield:a))
DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord


Re: Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran



On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 You should start by checking out the SweetSpotSimilarity .. it was
 heavily designed arround the idea of dealing with things like excessively
 verbose titles, and keyword stuffing in summary text ... so you can
 configure your expectation for what a normal length doc is, and they
 will be penalized for being longer then that.  similarly you can say what
 a 'resaonable' tf is, and docs that exceed that would't get added boost
 (which in conjunction with teh lengthNorm penality penalizes docs that
 stuff keywords)


 https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html


 https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg

 https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg


 -Hoss
 http://www.lucidworks.com/



Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.

Much appreciated,
Mihran


boosting by geodist - GC Overhead Limit exceeded

2015-01-21 Thread Mihran Shahinian
I am running solr 4.10.2 with geofilt (~20% of docs have 30+ lat/lon
points) and everything work hunky dori. Than I added a bf with geodist
along the lines of:
recip(geodist(),5,20,5)  after few hours of running I end up with OOM
GC overhead limit exceeded. I've seen this
https://issues.apache.org/jira/browse/LUCENE-4698 and few other relevant
tickets. Wanted to check if anyone has any successful remedies.

Many thanks,
Mihran

My gc params on amazon xl instance:
-server -Xmx8g -Xms8g
-XX:+HeapDumpOnOutOfMemoryError \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled

Screenshot from Eclipse Mat
[image: Inline image 1]