[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Paul Cowan (JIRA) Tue, 16 Dec 2008 19:13:09 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Cowan updated LUCENE-1494:
-------------------------------

    Attachment: LUCENE-1494-positionincrement.patch

Patch for part 2). This follows the my original idea suggested above; Chris 
Hostetter suggested another approach:

"couldn't this be solved by an Analyzer that counts the token per fieldname and 
implements getPositionIncrementGap as..
        int result - SOME_BIG_NUM - tokensSeenMap.get(fieldname);
        tokensSeenMap.put(fieldname, 0);
        return result;"

but I think this seems much lower overhead and, while it affects the Analyzer 
base class (and so is potentially high-impact) the way it's implemented won't 
affect existing implementations and gently deprecates the old way, while still 
letting implementing subclasses work as they did before, so I think this is low 
impact. Interested to see what people think.

> Additional features for searching for value across multiple fields 
> (many-to-one style)
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1494-multifield.patch, 
> LUCENE-1494-positionincrement.patch
>
>
> This issue is to cover the changes required to do a search across multiple 
> fields with the same name in a fashion similar to a many-to-one database. 
> Below is my post on java-dev on the topic, which details the changes we need:
> ---
> We have an interesting situation where we are effectively indexing two 
> 'entities' in our system, which share a one-to-many relationship (imagine 
> 'User' and 'Delivery Address' for demonstration purposes). At the moment, we 
> index one Lucene Document per 'many' end, duplicating the 'one' end data, 
> like so:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     userid: 1
>     userfirstname: fred
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> (note: 2 Documents indexed for user 1). This is somewhat annoying for us, 
> because when we search in Lucene the results we want back (conceptually) are 
> at the 'user' level, so we have to collapse the results by distinct user id, 
> etc. etc (let alone that it blows out the size of our index enormously). So 
> why do we do it? It would make more sense to use multiple fields:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> But imagine the search "+addresscountry:au +addressphone:5678". We'd like 
> this to match ONLY Mary, but of course it matches Fred also because he 
> matches both those terms (just for different addresses).
> There are two aspects to the approach we've (more or less) got working but 
> I'd like to run them past the group and see if they're worth trying to get 
> them into Lucene proper (if so, I'll create a JIRA issue for them)
> 1) Use a modified SpanNearQuery. If we assume that country + phone will 
> always be one token, we can rely on the fact that the positions of 'au' and 
> '5678' in Fred's document will be different.
>    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
>    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
>    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
> the slop of 0 means that we'll only return those where the two terms are in 
> the same position in their respective fields. This works brilliantly, BUT 
> requires a change to SpanNearQuery's constructor (which checks that all the 
> clauses are against the same field). Are people amenable to perhaps adding 
> another constructor to SNQ which doesn't do the check, or subclassing it to 
> do the same (give it a protected non-checking constructor for the subclass to 
> call)?
> 2) It gets slightly more complicated in the case of variable-length terms. 
> For example, imagine if we had an 'address' field ('123 Smith St') which will 
> result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of 
> course. One thing we've toyed with is the idea of using 
> getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 
> tokens, we might use a position increment gap of 100, and make the slop 
> factor 50; this works fine for the simple case (yay!), but with a great many 
> addresses-per-user starts to get more complicated, as the gap counts from the 
> last term (so the position sequence for a single value field might be 0, 100, 
> 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106, 
> 206, 207... so it's going to get out of sync). The simplest option here seems 
> to be changing (or supplementing)
>    public int getPositionIncrementGap(String fieldname)
> to
>    public int getPositionIncrementGap(String fieldname, int currentPos)
> so that we can override that to round up to the nearest 100 (or whatever) 
> based on currentPos. The default implementation could just delegate to 
> getPositionIncrementGap().
> ---
> Patches (x2) to follow shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Reply via email to