Boosting StandardQuery scores with a "subquery"?

2012-03-22 Thread Sean O'Connor

Hi all,
I'm having some trouble wrapping my head around boosting 
StandardQueries. It looks like the function: query(subquery, default) 
 is what I want, but 
the examples seem to focus on just returning a score (e.g. product of 
popularity and the score of the subquery). I assume my difficulty stems 
from the fact that I'd like to retrieve highlighting from one query, but 
impact score and 'relevance' by a different (sub)query.


Example:
q=content:(roi "return on investment" "return investment"~5)
fq=extension:(pdf doc)
boost=keywords:(financial investment profit loss) title:(financial 
investment profit loss) url:(investment investor relations phoenix)


So what I would like is to highlight the items in the query (e.g. 
'roi' 'return on investment'...) while _not_ highlighting the boosting 
terms (e.g. financial, investment, profit, loss). However, those 
documents with matches for the boost query would be ranked higher than 
those not matching.


Is there some existing way to do this? I'd like to keep the power 
of the standard queries (i.e. not dismax), and still get results that 
don't match the boost query (i.e. not using a filterquery) while having 
an arbitrary subquery impact the score of the main query while getting 
highlighting for only the main query. Obvious, right? :-) Any thoughts 
or pointers most welcome.

Thanks,

Sean





Re: Next Word - Any Suggestions?

2010-12-15 Thread Sean O'Connor

Hi Christopher,
One option comes to mind: shingles?

I have not done anything with them yet, but that is on my radar for 
sometime about a month out. Speaking unencumbered by experience or 
substantial understanding, my guess is that shingles would be great for 
you if you can select shingles with something like a terms prefix.


AFAIU: Shingling[1] basically takes a number of terms/words, and 
combines them into a single token. You could set the (max)shingle size 
to 2, and then find some way to use the terms component on the shingled 
field with a prefix, potentially:

http://wiki.apache.org/solr/TermsComponent

I'm interested in what you find out, so please post back if you 
find something outside the mailing list.

Thanks,

Sean


[1] see something like: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=%28shingle%29, 
but the Solr 1.4 Enterprise Search Server book is well worth the money, 
and I believe there is an ebook version for $10-20.


On 10/26/2010 08:26 AM, Christopher Ball wrote:

Am about to implement a custom query that is sort of mash-up of Facets,
Highlighting, and SpanQuery - but thought I'd see if anyone has done
anything similar.



In simple words, I need facet on the next word given a target word.



For example, if my index only had the following 5 documents (comprised of a
sentence each):



Doc 1 - The quick brown fox jumped over the fence.

Doc 2 - The sly fox skipped over the fence.

Doc 3 - The fat fox skipped his afternoon class.

Doc 4 - A brown duck and red fox, crashed the party.

Doc 5 - Charles Brown! Fox! Crashed my damn car.



The query should give the frequency of the distinct terms after the word
"fox":



skipped - 2

crashed - 2

jumped - 1



Long-term, do the opposite - frequency of the distinct terms before the word
"fox":



brown - 2

sly - 1

fat - 1

red - 1



My guess is that either the FastVectorHighlighter or SpanQuery would be a
reasonable starting point. I was hoping to take advantage of Vectors as I am
storing termVectors, termPositions, and termOffsets for the field in
question.



Grateful for any thoughts . . . reference implementations . . . words of
encouragement . . . free beer - whatever you can offer.



Gracias,



Christopher








SpanQuery basics in Solr QueryComponent(?)

2010-11-10 Thread Sean O'Connor

Hi all,
I seem to be lost in the new flex indexing api. In the older api I 
was able to extend QueryComponent with my custom component, parse a 
restricted-syntax user query into a SpanQuery, and then grab an 
IndexReader. From there I worked with the spanquery's spans. For a bit 
of reference my old QueryComponent code looks something like:


 @Override
public void process(ResponseBuilder rb) throws IOException {
SolrQueryRequest req = rb.req;
SolrQueryResponse rsp = rb.rsp;
SDRQParser qparser = (SDRQParser) rb.getQparser();

SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand();
// custom parser returns SpanQuery

IndexReader reader = req.getSearcher().getReader();
Spans spans = stq.getSpans(reader);
// work with spans here...

}

With the new (1.5?) api, I got the warning about wrapping 
IndexReader with SlowMultiReaderWrapper, so I changed my approach above 
to something like:


 SolrIndexReader fullReader = req.getSearcher().getReader();
 IndexReader reader = SlowMultiReaderWrapper.wrap(fullReader);
// need help avoiding this...?


I then got a NPE on what seems to be EmptyTerms.toString(). For 
kicks, I noticed that EmpytyTerms did not override its parent 
(TermSpans) toString() method, which seemed to be the cause of the 
problems. Overriding that, fixed the NPE, and now I get results (so I 
will look at filing a bug report unless someone mentions otherwise).


Any hints on how I can/should 'properly' work with spans in solr? 
Also, are there any introductory documents to the MultiFields and 
sub-indexes stuff? Particularly how to implement MultiFields as a better 
approach to SlowMultiReaderWrapper (thanks for the warnings about 
performance). I cannot seem to find the relevant beginner material to 
avoid using the SMRW. The material I do find seems to require that you 
pass in a 'found' document, or perhaps walk through all subReaders?


And finally: should I be looking at some existing Solr code to lead 
guide me? I am having trouble finding the highlighter code which I 
believe uses spans (WeightedSpanTerm??). Is there already code to 
convert user queries to span queries?

Thanks,

Sean






Re: Next Word - Any Suggestions?

2010-11-10 Thread Sean O'Connor

Hi Christopher,
I am working my way through trying to implement SpanQueries in Solr 
(svn trunk). From my lack of progress, I am skeptical that I can help 
much, but I would be happy to try.


I imagine you have already found (either before your message, or 
after posting it) Grant's lucene, spanquery, and WindowTermVectorMapper 
overview:

 
http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/

  I'd be interested in hearing about your progress.
Good luck

Sean



On 10/26/2010 08:26 AM, Christopher Ball wrote:

Am about to implement a custom query that is sort of mash-up of Facets,
Highlighting, and SpanQuery - but thought I'd see if anyone has done
anything similar.



In simple words, I need facet on the next word given a target word.



For example, if my index only had the following 5 documents (comprised of a
sentence each):



Doc 1 - The quick brown fox jumped over the fence.

Doc 2 - The sly fox skipped over the fence.

Doc 3 - The fat fox skipped his afternoon class.

Doc 4 - A brown duck and red fox, crashed the party.

Doc 5 - Charles Brown! Fox! Crashed my damn car.



The query should give the frequency of the distinct terms after the word
"fox":



skipped - 2

crashed - 2

jumped - 1



Long-term, do the opposite - frequency of the distinct terms before the word
"fox":



brown - 2

sly - 1

fat - 1

red - 1



My guess is that either the FastVectorHighlighter or SpanQuery would be a
reasonable starting point. I was hoping to take advantage of Vectors as I am
storing termVectors, termPositions, and termOffsets for the field in
question.



Grateful for any thoughts . . . reference implementations . . . words of
encouragement . . . free beer - whatever you can offer.



Gracias,



Christopher








Best way to gather span/token positions from query? (mis-posted to dev list...)

2009-04-30 Thread Sean O'Connor

Hello,
   I'm trying to find a decent approach for getting token positions out
of (or is that into?) Solr query results. Is the best approach to extend
a QueryComponent and/or HighlightComponent? I'm new to solr, and still
on fairly shaky ground so any pointers or suggestions are quite welcome.

   As a little BACKGROUND:
   I am trying to migrate a custom  lucene-only content anaylsis
project to solr. The 'old' system programmatically runs a few thousand
predefined queries against a corpus, and then analyzes the results. The
lucene score is good, but the actual position of the hits is also quite
important.

   My previous system did a simple query parsing to create SpanQuerys,
and then used a modified dumpSpans() to get the token position from the
spans. Now I am trying to find how to use solr's goodness (and
MemoryIndex approach?) to get the span positions in a more logical
manner. I think the answer is in the highlighter, but I'm getting a
little twisted around, and could use a pointer.

   I am using a recent Solr nightly snapshot, grails, Aduna Aperture,
and Intellij (if any of that matters). Also, I posted this to the dev
list, incorrectly I believe; apologies for the cross posting.
Thanks,

Sean