Tika 0.9 integration in Solr 3.3.0

2011-08-19 Thread nirnaydewan
As in Tika 0.9, the formatting issue for extracting content from PDF & DOC files have been fixed, i want to integrate this in my existing Solr project. Please let me know the steps. All i have is the downloaded folder of Solr 3.3.0 and currently using the attached Jetty server only. This version

Issue in text extraction in Solr / Tika

2011-08-19 Thread nirnaydewan
I am using Solr 3.3.0 using the attached jetty server. When i upload ms word documents or pdf files, the text is not formatted properly. 1. There is no line breaks between sentences. The text is extracted in a single line or string. 2. Wherever there are boxes in word documents , some weird char

Re: Tika 0.9 integration in Solr 3.3.0

2011-08-19 Thread nirnaydewan
Thanks for your suggestion. Do i just need to replace these jars ? How do i build again? I am just using start.jar as of now. -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-0-9-integration-in-Solr-3-3-0-tp3267799p3268030.html Sent from the Apache Tika - Development m

Re: Issue in text extraction in Solr / Tika

2011-08-19 Thread nirnaydewan
Thanks for your suggestion Mike. Attached is the ms word file. What happens is that, i get a single line of text but i want it be formatted as it is so that i can display it in highlighting. Thanks http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc 2011-01-23-7-22-

Re: Tika 0.9 integration in Solr 3.3.0

2011-08-19 Thread nirnaydewan
Thanks Tom again for trying to help me out but it didn't work out on my side: What i did: poi-3.8-beta3-20110606 pdfbox-app-1.6.0 tika-core-0.9 Replaced all the jars above as you said. The following jars were also necessary as it was giving many errors: poi-scratchpad-3.8-beta3-20110606 poi-oo

Re: Issue in text extraction in Solr / Tika

2011-08-19 Thread nirnaydewan
First of all thanks again Mike for helping me out. Yes, i have seen that, some text do get stripped out sometimes. Any idea as to why this could be happening? I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move to 0.9? if so how? Also i am storing this text only which i am

Preview of Rich Documents

2011-08-20 Thread nirnaydewan
Currently i am using Solr 3.3.0 to index Rich Documents like MS Word. This also includes PDF as well. I want to show the whole indexed text as a preview after a search is made and found in the specific documents. For e.g, if i make a search of the word "marketing" and this is found in documents A

Re: Preview of Rich Documents

2011-08-22 Thread nirnaydewan
Thanks much for your suggestion. But for the XHTML output, i believe that is one time process while extraction is being done. That means again i have to store/index that xhtml output text as well for later use. Is this correct or am i missing something? Regards -- View this message in contex

Re: Tika 0.9 integration in Solr 3.3.0

2011-08-22 Thread nirnaydewan
Please let me know how can i get rid of this exception. -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-0-9-integration-in-Solr-3-3-0-tp3267799p3274463.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.