from:"Mihran Shahinian"

Distributed Search component question

2015-06-19 Thread Mihran Shahinian

Hi all,
I have the following search components that I don't have a solution at the
moment to get them working in distributed mode on solr 4.10.4.

[standard query component]
[search component-1] (StageID - 2500):
 handleResponses: get few values from docs and populate parameters for
stats component and set some metadata in the ResponseBuilder
  rb.rsp.add(metadata, NamedList...)

distributedProcess:
   rb.doFacets=false;
   if (rb.stage  StageID)
  if( null == rb.rsp[metadata] ) {
   return StageID;
   }
return component-2.StageID

[search component-2] (StageID - 2800):
distributedProcess:
   rb.doFacets=true;
   formatAndSet some facetParams based on rb.rsp[metadata]
   return ResponseBuilder.STAGE_GET_FIELDS

[standard facet component]:


Things seem to work fine between component-1 and component-2, I just can't
prevent facets from running
until component-2 sets proper facet params. And than facet component sets
the rb._facetInfo to null. Should I move my logic in component-2 from
distributeProcess to handleResponses and modify ShardRequest and set
rb.addRequest?

Any hints are much appreciated.
Mihran

PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour

2015-05-11 Thread Mihran Shahinian

I must be missing something obvious.I have a simple regex that removes
spacehyphenspace pattern.

The unit test below works fine, but when I plug it into schema and query,
regex does not match, since input already gets split by space (further
below). My understanding that charFilter would operate on raw input string
and than pass it to the whitespace tokenizer which seems to be the case,
but I am not sure why I get already split token stream.

Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String
fieldName,
 Reader reader)
{
Tokenizer tokenizer = new MockTokenizer(reader,

MockTokenizer.WHITESPACE,
false);
return new TokenStreamComponents(tokenizer,
 tokenizer);
}

@Override
protected Reader initReader(String fieldName,
Reader reader) {
return new
PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+),
 ,
reader);
}
};

final TokenStream tokens = analyzer.tokenStream(,  new
StringReader(a - b));
tokens.reset();
final CharTermAttribute termAtt =
tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(===  +
   new String(Arrays.copyOf(termAtt.buffer(),
termAtt.length(;
}

I end up with:
=== a
=== b


Now I define the same in my schema:

fieldType name=text class=solr.TextField positionIncrementGap=100
 multiValued=true autoGeneratePhraseQueries=false
analyzer  type=index
 tokenizer class=solr.WhitespaceTokenizerFactory
/
/analyzer
analyzer  type=query
charFilter
class=solr.PatternReplaceCharFilterFactory
pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ;
 /
tokenizer class=solr.WhitespaceTokenizerFactory /
/analyzer
/fieldType

field name=myfield type=text indexed=true stored=false
multiValued=true/

When I query the input already comes in split into (e.g. a,-,b)
PatternReplaceCharFilter's processPattern method so regex would not match.
CharSequence processPattern(CharSequence input) ...
even though charFilter is defined before tokenizer.




Here is the query
SolrQuery solrQuery = new SolrQuery(a - b);
solrQuery.setRequestHandler(/select);
solrQuery.set(defType,
  edismax);
solrQuery.set(qf,
  myfield);
solrQuery.set(CommonParams.ROWS,
  0);
solrQuery.set(CommonParams.DEBUG,
  true);
solrQuery.set(CommonParams.DEBUG_QUERY,
  true);
QueryResponse response = solrSvr.query(solrQuery);

System.out.println(parsedQtoString  +
   response.getDebugMap()
   .get(parsedquery_toString));
System.out.println(parsedQ  +
   response.getDebugMap()
   .get(parsedquery));

Output is
parsedQtoString +((myfield:a) (myfield:-) (myfield:b))
parsedQ (+(DisjunctionMaxQuery((myfield:a))
DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord

Re: Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian

Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran

On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:

You should start by checking out the SweetSpotSimilarity .. it was
heavily designed arround the idea of dealing with things like excessively
verbose titles, and keyword stuffing in summary text ... so you can
configure your expectation for what a normal length doc is, and they
will be penalized for being longer then that. similarly you can say what
a 'resaonable' tf is, and docs that exceed that would't get added boost
(which in conjunction with teh lengthNorm penality penalizes docs that
stuff keywords)

https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg

https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg

-Hoss
http://www.lucidworks.com/

Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian

Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.

Much appreciated,
Mihran

boosting by geodist - GC Overhead Limit exceeded

2015-01-21 Thread Mihran Shahinian

I am running solr 4.10.2 with geofilt (~20% of docs have 30+ lat/lon
points) and everything work hunky dori. Than I added a bf with geodist
along the lines of:
recip(geodist(),5,20,5)  after few hours of running I end up with OOM
GC overhead limit exceeded. I've seen this
https://issues.apache.org/jira/browse/LUCENE-4698 and few other relevant
tickets. Wanted to check if anyone has any successful remedies.

Many thanks,
Mihran

My gc params on amazon xl instance:
-server -Xmx8g -Xms8g
-XX:+HeapDumpOnOutOfMemoryError \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled

Screenshot from Eclipse Mat
[image: Inline image 1]

Distributed Search component question

PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour

Re: Relevancy : Keyword stuffing

Relevancy : Keyword stuffing

boosting by geodist - GC Overhead Limit exceeded

5 matches

Site Navigation

Mail list logo

Footer information