[
https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744117#action_12744117
]
Preetam Rao commented on SOLR-633:
--
Hi, Sorry for such a delay.
let me take an example of a real estate site that I tried to implement free
text search on, using dis max query.
Also, when I say sub phrase, I mean adjacent terms appearing in a bigger phrase,
The index has the below fields and below example record.
lets say there are about 4 million records.
city - New York
state - NY
beds (Multi valued or synonyms)- 3 beds, beds 3
baths (Multi valued or synonyms) - 4 baths, baths 4
description - newly built with swimming pool, new furniture, car parking etc
sales type - new home
Lets say the user enters a query like homes in new york for price 400k with 3
beds 4 baths with swimming pool car parking
I played with dismax for few days trying out various boosts and factors.The
phrase options of dismax are not very useful because they consider all terms of
the phrase to appear in a given field. (Thats what it appeared like). Word like
new appearing in description field multiple times, or cities like york
seemed to cause some variations.
The nature of the problem here is that, sub phrases like new york, 3 beds
price 400k, car parking become very important and must be matched in
different fields without overlapping across fields.
This can be best solved by a SubPhraseQuery which is used by a DisMax-like
query to combine multiple fields.
hence this is what I proposed:
SubPhraseQuery:
- scores based on longest sub phrases matched. Also gives a factor to boost
based on match length. For example 4 word matches gets 16 score vs a 3 word
match getting 9
- gives an option to score only one match per field. For example, a term new
home gets scored only once even if it occurs N times in the description field.
- Option to score only longest match. For example, an occurrence of swimming
pool and some other pool scores only swimming pool.
- As usual, ability to ignore IDF, norms and any other factors, but just use
phrase match.
And a DisMax-like query that uses the above:
- Each field can be configured with above query.
- Options to ignore matches in other fields when some match.
I feel this kind of use cases will be encountered when form searches are
migrated to free text search, since we are trying to use solr's free text
search on some kind of structured data where different fields have different
meaning.
Probably dismax is meant for that use case. I spent few days fine tuning dismax
for the above use case. Just that, I felt like I had play a lot with various
factors and it looked like lot of trial and error and still I was not sure what
would the end results look like. I felt that I needed some more control over
individual fields and how a match would be scored in those fields on sub
phrases.
Let me know your thoughts or alternatives and I will be glad to look at them.
QParser for use with user-entered query which recognizes subphrases as well
as allowing some other customizations on per field basis
Key: SOLR-633
URL: https://issues.apache.org/jira/browse/SOLR-633
Project: Solr
Issue Type: New Feature
Components: search
Affects Versions: 1.4
Environment: All
Reporter: Preetam Rao
Priority: Minor
Fix For: 1.5
Create a request handler (actually a QParser) for use with user entered
queries with following features-
a) Take a user query string and try to match it against multiple fields,
while recognizing sub-phrase matches.
b) For each field give the below parameters:
1) phraseBoost - the factor which decides how good a n token sub phrase
match is compared to n-1 token sub-phrase match.
2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the
highest
3) ignoreDuplicates - If the same sub-phrase query matches multiple times,
pick only one.
4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other
parameters which are not relevant.
c) Try to provide all the parameters similar to dismax. Reuse or extend
dismax.
Other suggestions and feedback appreciated :-)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.