Would it be possible to share the DIH config file?

I am not sure if I get all your points correctly.

Ad 1) is this about a value in a field? Then use the regex transformer:
https://wiki.apache.org/solr/DataImportHandler#RegexTransformer
Alternatively, use a RegexReplaceProcessorFactoryin solrconfig.xml or a
ScriptTransformer in DIH. E.g. a RegexReplaceProcessorFactory (
https://lucene.apache.org/solr/7_3_0//solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html)
in a custom processing chain in solrconfig.xml
<updateRequestProcessorChain name="regex_replace>
 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">\n|\t|\r</str>
   <str name="replacement"></str>
   <bool name="literalReplacement">true</bool>
 </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

and attach it to your dih in solrconfig.xml
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
  <str name="config">data-config.xml</str>
    <str name="update.chain">regex_replace</str>
</lst>
</requestHandler>




ad 2) was this html part of the original document or is it "HTML" generated
by Tika. In the first case then you can use a
HTMLStripFieldUpdateProcessorFactory that should be configured in the
solrconfig.xml:
https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
You need to create an update processor chain
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain


<updateRequestProcessorChain name="remove_html">
  <processor class="solr.HTMLStripFieldUpdateProcessorFactory">
<str name="fieldName">myfyfield</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

and attach it to your dih in solrconfig.xml
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
  <str name="config">data-config.xml</str>
    <str name="update.chain">remove_html</str>
</lst>
</requestHandler>

In the second case (Tika attaches XML elements) specify
extractFormat="text" for Tika in DIH :
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

add 3) see 1)

Note: You can only create one chain / DIH, so you need to put all the
processors that you want to apply into one chain. The transformers are
independent of the processors and are configured in the DIH.



On Tue, Mar 12, 2019 at 7:47 PM wclarke <wcla...@widernet.org> wrote:

> I have a previous post that looks like this:
>
> I am pulling a large amount of data from a local source D:\foo\resource\.
> I
> am using tika through a DIH to index the multiple file formats with text
> and
> metadata.  I have almost all the information being pulled that I want,
> however, I am having a couple of issues:
>
> 1. I need to run a regex replace of the D:\foo\resource\ to be http://,
> which is part of what I want to use XPath for.  I have the regex written,
> but not the replacement and I am not sure of where it needs to be located
> in
> my data-config.xml file.
>
> 2. I want to strip html where necessary also using XPath.
>
> 3. I need to remove \n, \t, \r, and any other extra crap I am getting in
> the
> text field to just get to the text content of the document, whatever mime
> type that might be so that it can be searchable.
>
> I am running it through the solr admin data import as opposed to the
> post.jar (I have tried both).  And this is running on Windows and cannot be
> run on Linux as we have no one who can support it.  I am posting my
> tika-data-config.xml (not tikaconfig) I named it this way so as not to be
> confused with our db-config for our catalog pull.
>
> Thanks in advance for any help.  And I will upload any additional files
> that
> might be helpful upon request - I don't want to overload the post.
>
> We are a small non-profit without a great deal of money, however, if there
> is someone who could finish writing it we would be willing to pay a little
> something for time.  We really need this done ASAP!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Reply via email to