Some addition: You can also strip HTML in DIH using the HTML Strip transformer: https://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer
In that way you can probably live without a UpdateRequestProcessorChain On Tue, Mar 12, 2019 at 10:24 PM Jörn Franke <jornfra...@gmail.com> wrote: > Would it be possible to share the DIH config file? > > I am not sure if I get all your points correctly. > > Ad 1) is this about a value in a field? Then use the regex transformer: > https://wiki.apache.org/solr/DataImportHandler#RegexTransformer > Alternatively, use a RegexReplaceProcessorFactoryin solrconfig.xml or a > ScriptTransformer in DIH. E.g. a RegexReplaceProcessorFactory ( > https://lucene.apache.org/solr/7_3_0//solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html) > in a custom processing chain in solrconfig.xml > <updateRequestProcessorChain name="regex_replace> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">\n|\t|\r</str> > <str name="replacement"></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > and attach it to your dih in solrconfig.xml > <requestHandler name="/dataimport" class="solr.DataImportHandler"> > <lst name="defaults"> > <str name="config">data-config.xml</str> > <str name="update.chain">regex_replace</str> > </lst> > </requestHandler> > > > > > ad 2) was this html part of the original document or is it "HTML" > generated by Tika. In the first case then you can use a > HTMLStripFieldUpdateProcessorFactory that should be configured in the > solrconfig.xml: > https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html > You need to create an update processor chain > https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain > > > <updateRequestProcessorChain name="remove_html"> > <processor class="solr.HTMLStripFieldUpdateProcessorFactory"> > <str name="fieldName">myfyfield</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > and attach it to your dih in solrconfig.xml > <requestHandler name="/dataimport" class="solr.DataImportHandler"> > <lst name="defaults"> > <str name="config">data-config.xml</str> > <str name="update.chain">remove_html</str> > </lst> > </requestHandler> > > In the second case (Tika attaches XML elements) specify > extractFormat="text" for Tika in DIH : > https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html > > add 3) see 1) > > Note: You can only create one chain / DIH, so you need to put all the > processors that you want to apply into one chain. The transformers are > independent of the processors and are configured in the DIH. > > > > On Tue, Mar 12, 2019 at 7:47 PM wclarke <wcla...@widernet.org> wrote: > >> I have a previous post that looks like this: >> >> I am pulling a large amount of data from a local source >> D:\foo\resource\. I >> am using tika through a DIH to index the multiple file formats with text >> and >> metadata. I have almost all the information being pulled that I want, >> however, I am having a couple of issues: >> >> 1. I need to run a regex replace of the D:\foo\resource\ to be http://, >> which is part of what I want to use XPath for. I have the regex written, >> but not the replacement and I am not sure of where it needs to be located >> in >> my data-config.xml file. >> >> 2. I want to strip html where necessary also using XPath. >> >> 3. I need to remove \n, \t, \r, and any other extra crap I am getting in >> the >> text field to just get to the text content of the document, whatever mime >> type that might be so that it can be searchable. >> >> I am running it through the solr admin data import as opposed to the >> post.jar (I have tried both). And this is running on Windows and cannot >> be >> run on Linux as we have no one who can support it. I am posting my >> tika-data-config.xml (not tikaconfig) I named it this way so as not to be >> confused with our db-config for our catalog pull. >> >> Thanks in advance for any help. And I will upload any additional files >> that >> might be helpful upon request - I don't want to overload the post. >> >> We are a small non-profit without a great deal of money, however, if there >> is someone who could finish writing it we would be willing to pay a little >> something for time. We really need this done ASAP! >> >> >> >> -- >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >> >