I have a previous post that looks like this:

I am pulling a large amount of data from a local source D:\foo\resource\.  I
am using tika through a DIH to index the multiple file formats with text and
metadata.  I have almost all the information being pulled that I want,
however, I am having a couple of issues: 

1. I need to run a regex replace of the D:\foo\resource\ to be http://,
which is part of what I want to use XPath for.  I have the regex written,
but not the replacement and I am not sure of where it needs to be located in
my data-config.xml file. 

2. I want to strip html where necessary also using XPath. 

3. I need to remove \n, \t, \r, and any other extra crap I am getting in the
text field to just get to the text content of the document, whatever mime
type that might be so that it can be searchable. 

I am running it through the solr admin data import as opposed to the
post.jar (I have tried both).  And this is running on Windows and cannot be
run on Linux as we have no one who can support it.  I am posting my
tika-data-config.xml (not tikaconfig) I named it this way so as not to be
confused with our db-config for our catalog pull. 

Thanks in advance for any help.  And I will upload any additional files that
might be helpful upon request - I don't want to overload the post.

We are a small non-profit without a great deal of money, however, if there
is someone who could finish writing it we would be willing to pay a little
something for time.  We really need this done ASAP!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to