RE: Wildcard searches????

Digy Fri, 05 Feb 2010 14:25:57 -0800

http://en.wikipedia.org/wiki/Crossposting


-----Original Message-----
From: Niclas Rothman [mailto:n...@lechill.com] 
Sent: Saturday, February 06, 2010 12:12 AM
To: gene...@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Hi Fuad and thanks for your reply!

The first post I know now was a wrong approach, I should not have the wildcard 
included in my index. 

However, I can't do as you suggest, to have the full user agent in the index, 
that’s the whole idea actually. 

The reason can be explained like this, device manufactures are literally 
spitting out new devices and updates all the time which generates new user 
agents that are very similar, perhaps only a small version number differs. 
So what I need is to have a "generalization" of the user agent in  my index, to 
only have the start of the useragent without including the versions numbers. 
This way my index are all the time "up to date" even if users with new version 
numbers access my search service, which in my app isn’t significant but instead 
causing my problems.... 

Example:

I have 2 Indexed documents where the documents useragent field are partial:
<doc>
        <id>1</id>
        <useragents>
        Firefox
            Mozilla/4.0+SonyEricsson
        </useragents>
</doc>
<doc>
        <id>2</id>
        <useragents>
        Firefox
            Mozilla/4.0+SonyEricsson
        </useragents>
</doc>

User A searches my app with an user agent as: 
        
        
Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0

The search app will display both document 1 and 2, because his user agent 
starts exactly has the user agent pattern in my document.


User B searches my app with an user agent as (Please note that this user agent 
differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)): 
        
        
Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0

The search app will also display both document 1 and 2, because his user agent 
starts exactly has the user agent pattern in my document. 
Even if the version number of the java platform differs between user A and  B. 

If we now have a different index with FULL user agents, only User A would have 
documents returned, none of the documents user agents matched Users B user 
agent because of the "silly" version number!!

<doc>
        <id>1</id>
        <useragents>
        Firefox
            
Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
        </useragents>
</doc>
<doc>
        <id>2</id>
        <useragents>
        Firefox
            
Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
        </useragents>
</doc>

Can you see my problem?
So the basic thing is if I somehow can do a query saying that at match should 
take place if a document useragent starts with the value of the users 
useragent. 

In theory, having a startsWith "function / locig are easy enough to implement 
in C# / T-SQL,  but how on earth should I do this in SolR / Lucene?????

Regards

Niclas














-----Original Message-----
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: 05 February 2010 22:49
To: gene...@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: RE: Wildcard searches????

Niclas,

I looked at your initial post, you are creating document with field "abc*"
- nothing related to "wildcard query"!

Of course, query [useragents:abcdefghijklm] will return no results, and 
[q=useragents:abc] no results, but [q=useragents:abc*] will return something.

text_nav is specific SOLR type for _leading_ wildcard queries; you don't need 
it (you don't need _leading_ wildcard queries).

On indexing time, instead of
<doc>
<useragents>
                Firefox*
                Mozilla/4.0*
</useragents>
</doc>


You should index
<doc>
<useragents>
        
Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
</useragents>
</doc>

And also, you need to choose properly SOLR type; for instance, textTight or 
textgen, or even non-tokenized string!


And, query [q=useragents:moz*] will return this document (even if this field is 
nontokenized).


-Fuad


P.S. Don't use * when you create Lucene document; use it as part of query.




> -----Original Message-----
> From: Niclas Rothman [mailto:n...@lechill.com]
> Sent: February-05-10 4:44 PM
> To: gene...@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Ted im using SOLR, but I cant figure out what type of fieldtype I should
> use to get a query like this to work:
> 
> 
> q=useragents: abcdefghijklm
> 
> 
> where I have in my index one document with value "abc" in field
> "useragents"
> 
> That query results in 0 hits.
> 
> If I issue this I get 1 hit of course (exact mathch)
> 
> q=useragents: Mozilla
> 
> 
> My document definition in SOLR looks like:
> 
> <fields>
>     <field name="id" type="tint" indexed="true" stored="true"
> required="true" />
>     <field name="useragents" type="text_rev" indexed="true"
> stored="true" required="false" multiValued="true" />
> </fields>
> 
> Any clue?
> 
> Nic
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: 05 February 2010 21:18
> To: gene...@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> This is quite close.  You will have to break down the user agent that is
> your query into the same kinds of pieces as you did for your index.
> Lucene
> will only do exact matching of terms during searching (wildcard queries
> are
> handled by exploding the term into all possible variants).
> 
> Regarding the field type, you will probably have to customize that a
> fair
> bit to make +'s be separators and such.  If you use SOLR to index and
> query
> your data, then it will make sure that your separation into tokens is
> compatible unless you are using shortened forms like you mention here.
> 
> On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <n...@lechill.com>
> wrote:
> 
> > Hi again Ted and many thanks for your efforts.
> > Ok, just to be sure that we fully understand each other:
> >
> > In my index I will store partial useragents without any wildcards *,
> e.g.
> >
> > Fire    (for Firefox)
> > Inte    (Internet Explorer)
> > Moz     (Mozill)
> >
> >
> > When I during runtime search my index for Media objects that are
> compatible
> > with a useragent,
> > e.g:
> >
> >
> >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> >
> > Hopefully lucene / solr will serve me with all Media objects that
> partially
> > math my full user agent string and also perhaps some mismatches. To be
> > absolutely sure that I only show Media objects that are compatible, I
> will
> > have to loop through the resultset in my program to do a final test
> and
> > exclude any mismatches.
> >
> > Is this what you are saying Ted, that I cant do the whole process in
> Solr /
> > Lucene, that I need to do the final test in my program (C#)?
> >
> > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> the
> > useragent ( tokenized)
> >
> > Okey, lets see what you have to say about this.
> > Please bear with me, im all new to lucene and solr!!
> >
> > Regards
> > Niclas
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: 05 February 2010 20:43
> > To: gene...@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > Yes.  I think you have it.
> >
> > To explain in a bit more detail, I think that you should store a
> tokenized
> > form of the user agents and should query using a tokenized form of
> your
> > user
> > agent.  This will retrieve documents that have partial matches to the
> user
> > agent of interest.  Many of these matches, however, may not meet the
> > requirements of the wildcard expression in the documents.  As such,
> you
> > will
> > need to look at each retrieved document to retrieve the wild
> expression
> > from
> > each one in turn to test if the original (untokenized) query satisfies
> the
> > wildcard.
> >
> > If your wildcards are all of a positive nature as your example is,
> then
> > this
> > should work pretty well.
> >
> > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <n...@lechill.com>
> wrote:
> >
> > > Hi Ted and thanks for all your efforts.
> > > Listen im a little bit lost here trying to understand what you are
> trying
> > > to tell me :-)
> > >
> > > 1. I Store my useragents in a field that is tokenized.
> > > 2. Then when I search, you are saying that I should "scan" down the
> > matches
> > > via a SOLR function, or what?
> > > Are you referring to these functions in SOLR?
> > >
> > > http://wiki.apache.org/solr/FunctionQuery
> > >
> > >
> > > Sorry for not grasping immmediatley!
> > >
> > > Regards Niclas
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > > Sent: 05 February 2010 17:44
> > > To: gene...@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Tokenize your user agent strings, then store the tokenized form
> > separately
> > > from the wild card.  At retrieval time, scan down the matches and
> apply
> > the
> > > wildcard from each document to your original query.  The SOLR
> function
> > > query
> > > might be useful for this as would be a custom hit collector.
> > >
> > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <n...@lechill.com>
> wrote:
> > >
> > > > Hi there, i facing a problem and would like to ask the community
> for
> > some
> > > > help.
> > > >
> > > > In my index I store browser  useragent values as "wildcarded" /
> > partial,
> > > >  which should be understood that an indexed document
> > > > should only be shown to end users if his browsers useragent
> matches a
> > > > wildcared usereragent in my document.
> > > >
> > > > So what I have Is actually a "reversed" matching, the wildcards
> are in
> > my
> > > > document and NOT in my actual query.
> > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> query
> > in
> > > > style with:
> > > >
> > > > useragents:
> > > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > In this example I would have a hit because Mozilla/4.0* matches
> the
> > > > useragent.
> > > >
> > > > <doc>
> > > > <useragents>
> > > >                Firefox*
> > > >                Mozilla/4.0*
> > > > </useragents>
> > > > </doc>
> > > >
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> 
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Wildcard searches????

Reply via email to