As in Tika 0.9, the formatting issue for extracting content from PDF & DOC
files have been fixed, i want to integrate this in my existing Solr project.
Please let me know the steps.
All i have is the downloaded folder of Solr 3.3.0 and currently using the
attached Jetty server only. This version
I am using Solr 3.3.0 using the attached jetty server. When i upload ms word
documents or pdf files, the text is not formatted properly.
1. There is no line breaks between sentences. The text is extracted in a
single line or string.
2. Wherever there are boxes in word documents , some weird char
Thanks for your suggestion. Do i just need to replace these jars ?
How do i build again? I am just using start.jar as of now.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Tika-0-9-integration-in-Solr-3-3-0-tp3267799p3268030.html
Sent from the Apache Tika - Development m
Thanks for your suggestion Mike. Attached is the ms word file.
What happens is that, i get a single line of text but i want it be formatted
as it is so that i can display it in highlighting.
Thanks
http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc
2011-01-23-7-22-
Thanks Tom again for trying to help me out but it didn't work out on my side:
What i did:
poi-3.8-beta3-20110606
pdfbox-app-1.6.0
tika-core-0.9
Replaced all the jars above as you said.
The following jars were also necessary as it was giving many errors:
poi-scratchpad-3.8-beta3-20110606
poi-oo
First of all thanks again Mike for helping me out.
Yes, i have seen that, some text do get stripped out sometimes. Any idea as
to why this could be happening?
I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
to 0.9? if so how?
Also i am storing this text only which i am
Currently i am using Solr 3.3.0 to index Rich Documents like MS Word. This
also includes PDF as well.
I want to show the whole indexed text as a preview after a search is made
and found in the specific documents.
For e.g, if i make a search of the word "marketing" and this is found in
documents A
Thanks much for your suggestion.
But for the XHTML output, i believe that is one time process while
extraction is being done. That means again i have to store/index that xhtml
output text as well for later use. Is this correct or am i missing
something?
Regards
--
View this message in contex
Please let me know how can i get rid of this exception.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Tika-0-9-integration-in-Solr-3-3-0-tp3267799p3274463.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.