O left a small mismatch on the field type, the fields I am trying to clean
are all “text_general“ (class solr.TextField)

Em ter., 20 de fev. de 2024 às 09:38, Gino Rodrigues <
[email protected]> escreveu:

> Hello everyone,
>
> I am trying to clean source fields from HTML markup before indexing, using
> an Update Request Processor.
>
> But no variation I try seems to work, and HTML markup is still being
> indexed.
>
> Would anyone have an idea about it?
>
> Thanks in advance!
>
> *indexing command*
> curl -X POST -H "Content-Type: application/csv" --data-binary @myfile.csv "
> http://localhost:8983/solr/mycore/update?commit=true";
>
> *managed-schema.xml*
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap
> ="100" multiValued="true">
> <analyzer type="index">
> <tokenizer name="standard"/>
> <filter words="stopwords.txt" ignoreCase="true" name="stop"/>
> <filter name="lowercase"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer name="standard"/>
> <filter words="stopwords.txt" ignoreCase="true" name="stop"/>
> <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true"
> expand="true"/>
> <filter name="lowercase"/>
> </analyzer>
> </fieldType>
> <field name="body" type="text_pt" indexed="true" stored="true"/>
> <copyField source="body" dest="catchall"/>
>
> *solrconfig.xml*
> <updateRequestProcessorChain>
> <processor class="solr.HTMLStripFieldUpdateProcessorFactory">
> <str name="typeClass">solr.TextField</str>
> </processor>
> </updateRequestProcessorChain>
>
> References
>
> https://solr.apache.org/guide/solr/9_4/configuration-guide/update-request-processors.html
>
> https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
>
> https://solr.apache.org/docs/9_4_1/core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html
>

Reply via email to