Hi Erick, can you use NLP for query-time facetting? How? Moreover, can you use it to find keyword patterns? Cheers, Philippe
> Le 23 mars 2015 à 18:44, Erick Erickson <erickerick...@gmail.com> a écrit : > > Be a little careful here about memory. Faceting on high-cardinality > fields is a very good way to encounter OOM and/or performance > problems. > > But you're right, in Solr faceting is a query-time construct, it needs > nothing at index time. The NLP stuff can help narrow down the number > of unique values in the field you're faceting on. > > Best, > Erick > >> On Mon, Mar 23, 2015 at 9:41 AM, <phi...@free.fr> wrote: >> I just want a list of recurring words (for now.) >> >> I removed the manually-created facets from solrconfig.xml and SOLR >> "automagically" created a facet list for me. >> >> But thanks for your suggestions. >> >> >> >> ----- Mail original ----- >> De: "Charlie Hull" <char...@flax.co.uk> >> À: solr-user@lucene.apache.org >> Envoyé: Lundi 23 Mars 2015 17:26:18 >> Objet: Re: Creating facets based on the content field >> >>> On 23/03/2015 16:08, phi...@free.fr wrote: >>> Let's say that one pdf has the following contents: >> >> Aren't you thinking of Named Entity Recognition? We've used Stanford NLP >> for this in the past and it's quite good at People, Places and >> Organisations out of the box (needs tuning for other classes of >> entities). You can then add these entities as metadata to your document >> objects and index them so you can facet on them appropriately. >> >> Cheers >> >> Charlie >>> >>> "[thousands of characters] blablabla Churchill blablabla [thousands of text >>> characters]" >>> >>> ... and another PDF contains: >>> >>> "[thousands of characters] blablabla Gandhi [thousands of characters] >>> Churchill blablabla [thousands of text characters]" >>> >>> As you can see, there two PDFs contain keywords that are potential >>> candidates for facets (e.g. Churchill, Gandhi, ...), but I have no >>> way of knowing that when adding facets to the solrconfig.xml file, unless I >>> read all the PDFs (which will take me years) and compile a list of >>> often-occurring words and names. >>> >>> The fallback solution is therefore to guess the keywords, which are likely >>> to appear in the PDFs; e.g.: >>> >>> <str name="facet.query">Aircraft</str> >>> <str name="facet.query">Armistice</str> >>> <str name="facet.query">Austria</str> >>> <str name="facet.query">Bolshevik</str> >>> <str name="facet.query">Britain</str> >>> <str name="facet.query">British</str> >>> <str name="facet.query">Charlie >>> Chaplin</str> >>> <str name="facet.query">Clemenceau</str> >>> <str name="facet.query">Einstein</str> >>> ... >>> >>> >>> However, how can I be sure that these facets will be useful to the other >>> 'core' users? For instance, let's say that one >>> user is more interested in Gandhi that Einstein: the "Einstein" facet is >>> therefore useless to him and a "Gandhi" facet is missing from >>> sorlconfig.xml. >>> >>> Is there a way to dynamically generate a list of facets based on words >>> contained in the content field? >>> >>> Cheers, >>> >>> Philippe >>> >>> >>> >>> >>> >>> ----- Mail original ----- >>> De: "Erik Hatcher" <erik.hatc...@gmail.com> >>> À: solr-user@lucene.apache.org >>> Envoyé: Lundi 23 Mars 2015 16:30:49 >>> Objet: Re: Creating facets based on the content field >>> >>> Philippe - can you provide a concrete example of what you mean by creating >>> facets on field’s content? Or maybe rather, what’s missing from doing >>> &facet.field=content currently? >>> >>> Erik >>> >>> >>> >>> >>>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote: >>>> >>>> Hello, >>>> >>>> let's say that you haved indexed hundreds of PDFs using the following curl >>>> command: >>>> >>>> curl -Ss -X POST >>>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf" >>>> >>>> The PDF's contents are now stored in core0's "content" field. >>>> >>>> I wonder how you create facets based on the field's contents, if you don't >>>> know in advance what it contains (unless you have compiled a list of >>>> frequently-occurring words in the PDFs, after reading them.) >>>> >>>> Many thanks. >>>> >>>> Philippe >> >> >> -- >> Charlie Hull >> Flax - Open Source Enterprise Search >> >> tel/fax: +44 (0)8700 118334 >> mobile: +44 (0)7767 825828 >> web: www.flax.co.uk