Re: Creating facets based on the content field

Philippe de Rochambeau Mon, 23 Mar 2015 14:21:18 -0700

Hi Erick,
can you use NLP for query-time facetting? How?
Moreover, can you use it to find keyword patterns?
Cheers,
Philippe



> Le 23 mars 2015 à 18:44, Erick Erickson <erickerick...@gmail.com> a écrit :
> 
> Be a little careful here about memory. Faceting on high-cardinality
> fields is a very good way to encounter OOM and/or performance
> problems.
> 
> But you're right, in Solr faceting is a query-time construct, it needs
> nothing at index time. The NLP stuff can help narrow down the number
> of unique values in the field you're faceting on.
> 
> Best,
> Erick
> 
>> On Mon, Mar 23, 2015 at 9:41 AM,  <phi...@free.fr> wrote:
>> I just want a list of recurring words (for now.)
>> 
>> I removed the manually-created facets from solrconfig.xml and SOLR 
>> "automagically" created a facet list for me.
>> 
>> But thanks for your suggestions.
>> 
>> 
>> 
>> ----- Mail original -----
>> De: "Charlie Hull" <char...@flax.co.uk>
>> À: solr-user@lucene.apache.org
>> Envoyé: Lundi 23 Mars 2015 17:26:18
>> Objet: Re: Creating facets based on the content field
>> 
>>> On 23/03/2015 16:08, phi...@free.fr wrote:
>>> Let's say that one pdf has the following contents:
>> 
>> Aren't you thinking of Named Entity Recognition? We've used Stanford NLP
>> for this in the past and it's quite good at People, Places and
>> Organisations out of the box (needs tuning for other classes of
>> entities). You can then add these entities as metadata to your document
>> objects and index them so you can facet on them appropriately.
>> 
>> Cheers
>> 
>> Charlie
>>> 
>>> "[thousands of characters] blablabla Churchill blablabla [thousands of text 
>>> characters]"
>>> 
>>> ... and another PDF contains:
>>> 
>>> "[thousands of characters] blablabla Gandhi [thousands of characters] 
>>> Churchill blablabla [thousands of text characters]"
>>> 
>>> As you can see, there two PDFs contain keywords that are potential 
>>> candidates for facets (e.g. Churchill, Gandhi, ...), but I have no
>>> way of knowing that when adding facets to the solrconfig.xml file, unless I 
>>> read all the PDFs (which will take me years) and compile a list of 
>>> often-occurring words and names.
>>> 
>>> The fallback solution is therefore to guess the keywords, which are likely 
>>> to appear in the PDFs; e.g.:
>>> 
>>>                                 <str name="facet.query">Aircraft</str>
>>>                                 <str name="facet.query">Armistice</str>
>>>                                 <str name="facet.query">Austria</str>
>>>                                 <str name="facet.query">Bolshevik</str>
>>>                                 <str name="facet.query">Britain</str>
>>>                                 <str name="facet.query">British</str>
>>>                                 <str name="facet.query">Charlie 
>>> Chaplin</str>
>>>                                 <str name="facet.query">Clemenceau</str>
>>>                                 <str name="facet.query">Einstein</str>
>>> ...
>>> 
>>> 
>>> However, how can I be sure that these facets will be useful to the other 
>>> 'core' users? For instance, let's say that one
>>> user is more interested in Gandhi that Einstein: the "Einstein" facet is 
>>> therefore useless to him and a "Gandhi" facet is missing from 
>>> sorlconfig.xml.
>>> 
>>> Is there a way to dynamically generate a list of facets based on words 
>>> contained in the content field?
>>> 
>>> Cheers,
>>> 
>>> Philippe
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ----- Mail original -----
>>> De: "Erik Hatcher" <erik.hatc...@gmail.com>
>>> À: solr-user@lucene.apache.org
>>> Envoyé: Lundi 23 Mars 2015 16:30:49
>>> Objet: Re: Creating facets based on the content field
>>> 
>>> Philippe - can you provide a concrete example of what you mean by creating 
>>> facets on field’s content?   Or maybe rather, what’s missing from doing 
>>> &facet.field=content currently?
>>> 
>>>     Erik
>>> 
>>> 
>>> 
>>> 
>>>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> let's say that you haved indexed hundreds of PDFs using the following curl 
>>>> command:
>>>> 
>>>> curl -Ss -X POST 
>>>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
>>>> 
>>>> The PDF's contents are now stored in core0's "content" field.
>>>> 
>>>> I wonder how you create facets based on the field's contents, if you don't 
>>>> know in advance what it contains (unless you have compiled a list of 
>>>> frequently-occurring words in the PDFs, after reading them.)
>>>> 
>>>> Many thanks.
>>>> 
>>>> Philippe
>> 
>> 
>> --
>> Charlie Hull
>> Flax - Open Source Enterprise Search
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.flax.co.uk

Re: Creating facets based on the content field

Reply via email to