Hi Remi,

Your use-case is more-or-less exactly what I wrote luwak for: 
https://github.com/flaxsearch/luwak.  You register your queries with a Monitor 
object, and then match documents against them.  The monitor analyzes the 
documents that are passed in and tries to filter out queries that it can detect 
won't match ahead of time, which is particularly useful if some of your queries 
are complex and expensive to run.

We've found that luwak performs better than the percolator out of the box 
(http://www.flax.co.uk/blog/2015/07/27/a-performance-comparison-of-streamed-search-implementations/),
 but depending on how many queries you have and how complex they are you may 
find that the percolator is a lot easier to set up, as it comes bundled as part 
of elasticsearch while luwak is just a Java library, and will require some 
coding to get it up and running.

Alan Woodward
www.flax.co.uk


On 3 Oct 2015, at 23:05, remi tassing wrote:

> @Jack: After reading the documentation, I think perlocator is what I'm
> after. The filtering possibility is extremely appealing as well. I'll have
> a closer look and experiment a bit.
> 
> @Erik: Yes that's right, notification is not really needed in my case
> though. It should be doable as you said…perlocator could be a good
> reference.
> 
> Thank you all guys!
> On Oct 3, 2015 6:08 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
> 
>> OK, finally the light dawns. You're doing something akin to "alerts".
>> that is, store a bunch of queries, then when a new document comes
>> in find out if any of the queries would match the doc and send
>> out alerts to each user who has entered a query like that. Your
>> situation may not be doing exactly that, but some kind of alerting
>> mechanism would work, right?
>> 
>> There are several approaches, Googling  "solr alerts" will
>> turn up several. Lucidworks, Flax and others have built some
>> tools (some commercial) for this ability.
>> 
>> One way to approach this is to store the queries "somewhere",
>> perhaps in a DB, perhaps in their own Solr collection, and write
>> a custom component that takes an incoming document and puts
>> it in a MemoryIndex, runs the queries against it and sends
>> the alerts. This requires some lower-level programming, but is
>> quite do-able.
>> 
>> Best,
>> Erick
>> 
>> On Sat, Oct 3, 2015 at 7:02 AM, Gili Nachum <gilinac...@gmail.com> wrote:
>>> Check if MLT (more like this) could fit your requirements.
>>> https://wiki.apache.org/solr/MoreLikeThis
>>> 
>>> If your requirements are more specific I think your client program should
>>> tokenize the target document then construct one or more queries like:
>>> "token token2" OR "token2 token3" OR ...
>>> 
>>> I'm not sure how you get the list of tokens , perhaps using the same api
>>> that the analyze admin page uses (haven't  checked )
>>> On Oct 3, 2015 09:32, "remi tassing" <tassingr...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> @Erik: Yes I'm using the admin-ui and yes I quickly notice
>> keywordTokenizer
>>>> couldn't work
>>>> @All: sorry for not explaining properly, I'm aware of the phrase query
>> and
>>>> a little bit of the N-Gram.
>>>> 
>>>> So to simplify my problem, the documents indexed are:
>>>> id:1, content:Mad Max
>>>> id:2, content:George Miller
>>>> id:3, content:global market
>>>> id:4, content:Solr development
>>>> 
>>>> Now the query is the content of the wiki page at
>>>> https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29
>>>> 
>>>> the results id:1, id:2, id:3 should be returned but not id:4. Today I'm
>>>> able to do this with something similar to grep (Aho-corasick) but the
>> list
>>>> is growing bigger and bigger. I thought Solr/Lucene could tackle this
>> more
>>>> efficiently and also add other capabilities like filtering ...
>>>> 
>>>> Maybe there is another tool more suitable for the job?
>>>> 
>>>> Remi
>>>> 
>>>> 
>>>> On Fri, Oct 2, 2015 at 10:07 PM, Andrea Roggerone <
>>>> andrearoggerone.o...@gmail.com> wrote:
>>>> 
>>>>> Hi, the phrase query format would be:
>>>>> "Mad Max"~2
>>>>> The * has been added by the mail aggregator around the chars in Bold
>> for
>>>>> some reason. That wasn't a wildcard.
>>>>> 
>>>>> On Friday, October 2, 2015, Roman Chyla <roman.ch...@gmail.com>
>> wrote:
>>>>> 
>>>>>> I'd like to offer another option:
>>>>>> 
>>>>>> you say you want to match long query into a document - but maybe you
>>>>>> won't know whether to pick "Mad Max" or "Max is" (not mentioning the
>>>>>> performance hit of "*mad max*" search - or is it not the case
>>>>>> anymore?). Take a look at the NGram tokenizer (say size of 2; or
>>>>>> bigger). What it does, it splits the input into overlapping segments
>>>>>> of 'X' words (words, not characters - however, characters work too -
>>>>>> just pick bigger N)
>>>>>> 
>>>>>> mad max
>>>>>> max 1979
>>>>>> 1979 australian
>>>>>> 
>>>>>> i'd recommend placing stopfilter before the ngram
>>>>>> 
>>>>>> - then for the long query string of "Hey Mad Max is 1979...." you
>>>>>> wold search "hey mad" OR "mad max" OR "max 1979"... (perhaps the
>> query
>>>>>> tokenizer could be convinced to the search for you automatically).
>> And
>>>>>> voila, the more overlapping segments there, the higher the search
>>>>>> result.
>>>>>> 
>>>>>> hth,
>>>>>> 
>>>>>> roman
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 2, 2015 at 12:03 PM, Erick Erickson <
>>>> erickerick...@gmail.com
>>>>>> <javascript:;>> wrote:
>>>>>>> The admin/analysis page is your friend here, find it and use it ;)
>>>>>>> Note you have to select a core on the admin UI screen before you
>> can
>>>>>>> see the choice.
>>>>>>> 
>>>>>>> Because apart from the other comments, KeywordTokenizer is a red
>>>> flag.
>>>>>>> It does NOT break anything up into tokens, so if your doc
>> contains:
>>>>>>> Mad Max is a 1979 Australian
>>>>>>> as the whole field, the _only_ match you'll ever get is if you
>> search
>>>>>> exactly
>>>>>>> "Mad Max is a 1979 Australian"
>>>>>>> Not Mad, not mad, not Max, exactly all 6 words separated by
>> exactly
>>>> one
>>>>>> space.
>>>>>>> 
>>>>>>> Andrea's suggestion is the one you want, but be sure you use one
>> of
>>>>>>> the tokenizing analysis chains, perhaps start with text_en (in the
>>>>>>> stock distro). Be sure to completely remove your node/data
>> directory
>>>>>>> (as in rm -rf data) after you make the change.
>>>>>>> 
>>>>>>> And really, explore the admin/analysis page; it's where a LOT of
>>>> these
>>>>>>> kinds of problems find solutions ;)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Erick
>>>>>>> 
>>>>>>> On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr <ravis...@gmail.com
>>>>>> <javascript:;>> wrote:
>>>>>>>> Hello Remi,
>>>>>>>>            Iam assuming the field where you store the data is
>>>>> analyzed.
>>>>>>>> The field definition might help us answer your question better.
>> If
>>>> you
>>>>>> are
>>>>>>>> using edismax handler for your search requests, I believe you can
>>>>>> achieve
>>>>>>>> you goal by setting set your "mm" to 100%, phrase slop "ps" and
>>>> query
>>>>>> slop
>>>>>>>> "qs" parameters to zero. I think that will force exact matches.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Ravi Kiran Bhaskar
>>>>>>>> 
>>>>>>>> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
>>>>>>>> andrearoggerone.o...@gmail.com <javascript:;>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Remy,
>>>>>>>>> The question is not really clear, could you explain a little bit
>>>>> better
>>>>>>>>> what you need? Reading your email I understand that you want to
>> get
>>>>>>>>> documents containing all the search terms typed. For instance if
>>>> you
>>>>>> search
>>>>>>>>> for "Mad Max", you wanna get documents containing both Mad and
>> Max.
>>>>> If
>>>>>>>>> that's your need, you can use a phrase query like:
>>>>>>>>> 
>>>>>>>>> *"*Mad Max*"~2*
>>>>>>>>> 
>>>>>>>>> where enclosing your keywords between double quotes means that
>> you
>>>>>> want to
>>>>>>>>> get both Mad and Max and the optional parameter ~2 is an
>> example of
>>>>>> *slop*.
>>>>>>>>> If you need more info you can look for *Phrase Query* in
>>>>>>>>> https://wiki.apache.org/solr/SolrRelevancyFAQ
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing <
>>>> tassingr...@gmail.com
>>>>>> <javascript:;>>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> I have medium-low experience on Solr and I have a question I
>>>>> couldn't
>>>>>>>>> quite
>>>>>>>>>> solve yet.
>>>>>>>>>> 
>>>>>>>>>> Typically we have quite short query strings (a couple of
>> words)
>>>> and
>>>>>> the
>>>>>>>>>> search is done through a set of bigger documents. What if the
>>>> logic
>>>>>> is
>>>>>>>>>> turned a little bit around. I have a document and I need to
>> find
>>>>> out
>>>>>> what
>>>>>>>>>> strings appear in the document. A string here could be a
>> person
>>>>> name
>>>>>>>>>> (including space for example) or a location...which are
>> indexed
>>>> in
>>>>>> Solr.
>>>>>>>>>> 
>>>>>>>>>> A concrete example, we take this text from wikipedia (Mad
>> Max):
>>>>>>>>>> "*Mad Max is a 1979 Australian dystopian action film directed
>> by
>>>>>> George
>>>>>>>>>> Miller <
>>>> https://en.wikipedia.org/wiki/George_Miller_%28director%29
>>>>>> .
>>>>>>>>>> Written by Miller and James McCausland from a story by Miller
>> and
>>>>>>>>> producer
>>>>>>>>>> Byron Kennedy <https://en.wikipedia.org/wiki/Byron_Kennedy>,
>> it
>>>>>> tells a
>>>>>>>>>> story of societal breakdown
>>>>>>>>>> <https://en.wikipedia.org/wiki/Societal_collapse>, murder,
>> and
>>>>>> vengeance
>>>>>>>>>> <https://en.wikipedia.org/wiki/Revenge>. The film, starring
>> the
>>>>>>>>>> then-little-known Mel Gibson <
>>>>>> https://en.wikipedia.org/wiki/Mel_Gibson>,
>>>>>>>>>> was released internationally in 1980. It became a top-grossing
>>>>>> Australian
>>>>>>>>>> film, while holding the record in the Guinness Book of Records
>>>>>>>>>> <https://en.wikipedia.org/wiki/Guinness_Book_of_Records> for
>>>>>> decades as
>>>>>>>>>> the
>>>>>>>>>> most profitable film ever created,[1]
>>>>>>>>>> <
>>>> https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-1
>>>>>> 
>>>>>> and
>>>>>>>>>> has
>>>>>>>>>> been credited for further opening the global market to
>> Australian
>>>>> New
>>>>>>>>> Wave
>>>>>>>>>> <https://en.wikipedia.org/wiki/Australian_New_Wave> films.*
>>>>>>>>>> <
>>>> https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-2
>>>>>> 
>>>>>>>>>> <
>>>> https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-3
>>>>>> "
>>>>>>>>>> 
>>>>>>>>>> I would like it to match "Mad Max" but not "Mad" or "Max"
>>>>>> seperately, and
>>>>>>>>>> "George Miller", "global market" ...
>>>>>>>>>> 
>>>>>>>>>> I've tried the keywordTokenizer but it didn't work. I suppose
>>>> it's
>>>>>> ok for
>>>>>>>>>> the index time but not query time (in this specific case)
>>>>>>>>>> 
>>>>>>>>>> I had a look at Luwak but it's not what I'm looking for (
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
>>>>>>>>>> )
>>>>>>>>>> 
>>>>>>>>>> The typical name search doesn't seem to work either,
>>>>>>>>>> https://dzone.com/articles/tips-name-search-solr
>>>>>>>>>> 
>>>>>>>>>> I was thinking this problem must have already be solved...or?
>>>>>>>>>> 
>>>>>>>>>> Remi
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 

Reply via email to