Thanks again. For the moment I think it won't be a problem. I have ~500 documents. Regards,
Francisco El vie., 11 de sept. de 2015 a la(s) 6:08 p. m., simon <mtnes...@gmail.com> escribió: > +1 on Sujit's recommendation: we have a similar use case (detecting drug > names / disease entities /MeSH terms ) and have been using the > SolrTextTagger with great success. > > We run a separate Solr instance as a tagging service and add the detected > tags as metadata fields to a document before it is ingested into our main > Solr collection. > > How many documents/product leaflets do you have ? The tagger is very fast > at the Solr level but I'm seeing quite a bit of HTTP overhead. > > best > > -Simon > > On Fri, Sep 11, 2015 at 1:39 PM, Sujit Pal <sujit....@comcast.net> wrote: > > > Hi Francisco, > > > > >> I have many drug products leaflets, each corresponding to 1 product. > In > > the > > other hand we have a medical dictionary with about 10^5 terms. > > I want to detect all the occurrences of those terms for any leaflet > > document. > > Take a look at SolrTextTagger for this use case. > > https://github.com/OpenSextant/SolrTextTagger > > > > 10^5 entries are not that large, I am using it for much larger > dictionaries > > at the moment with very good results. > > > > Its a project built (at least originally) by David Smiley, who is also > > quite active in this group. > > > > -sujit > > > > > > On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitch < > arafa...@gmail.com > > > > > wrote: > > > > > Assuming the medical dictionary is constant, I would do a copyField of > > > text into a separate field and have that separate field use: > > > > > > > > > http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html > > > with words coming from the dictionary (normalized). > > > > > > That way that new field will ONLY have your dictionary terms from the > > > text. Then you can do facet against that field or anything else. Or > > > even search and just be a lot more efficient. > > > > > > The main issue would be a gigantic filter, which may mean speed and/or > > > memory issues. Solr has some ways to deal with such large set matches > > > by compiling them into a state machine (used for auto-complete), but I > > > don't know if that's exposed for your purpose. > > > > > > But could make a fun custom filter to build. > > > > > > Regards, > > > Alex. > > > ---- > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > > > http://www.solr-start.com/ > > > > > > > > > On 10 September 2015 at 22:21, Francisco Andrés Fernández > > > <fra...@gmail.com> wrote: > > > > Yes. > > > > I have many drug products leaflets, each corresponding to 1 product. > In > > > the > > > > other hand we have a medical dictionary with about 10^5 terms. > > > > I want to detect all the occurrences of those terms for any leaflet > > > > document. > > > > Could you give me a clue about how is the best way to perform it? > > > > Perhaps, the best way is (as Walter suggests) to do all the queries > > every > > > > time, as needed. > > > > Regards, > > > > > > > > Francisco > > > > > > > > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre > > Rafalovitch < > > > > arafa...@gmail.com> escribió: > > > > > > > >> Can you tell us a bit more about the business case? Not the current > > > >> technical one. Because it is entirely possible Solr can solve the > > > >> higher level problem out of the box without you doing manual term > > > >> comparisons.In which case, your problem scope is not quite right. > > > >> > > > >> Regards, > > > >> Alex. > > > >> ---- > > > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > > > >> http://www.solr-start.com/ > > > >> > > > >> > > > >> On 10 September 2015 at 09:58, Francisco Andrés Fernández > > > >> <fra...@gmail.com> wrote: > > > >> > Hi all, I'm new to Solr. > > > >> > I want to detect all ocurrences of terms existing in a thesaurus > > into > > > 1 > > > >> or > > > >> > more documents. > > > >> > What´s the best strategy to make it? > > > >> > Doing a query for each term doesn't seem to be the best way. > > > >> > Many thanks, > > > >> > > > > >> > Francisco > > > >> > > > > > >