Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Gary Taylor Fri, 20 May 2011 08:15:38 -0700

Hello again. Unfortunately, I'm still getting nowhere with this. Ihave checked-out the 3.1 source and applied Jayendra's patches (seebelow) and it still appears that the contents of the files in thezipfile are not being indexed, only the filenames of those contained files.


I'm using a simple CURL invocation to test this:

curl"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";-F "commit=true" -F "file=@solr1.zip"

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'mexpecting the contents of those txt files to be extracted from the zipand indexed, but this isn't happening - or at least, I don't get thedesired result when I do a query afterwards. I do get a match if Isearch for either "doc1.txt" or "doc2.txt", but not if I search for aword that appears in their contents.

If I index one of the txt files (instead of the zipfile), I can querythe content OK, so I'm assuming my query is sensible and matches thefield specified on the CURL string (ie. "text"). I'm also happy thatthe Solr Cell content extraction is working because I can successfullyindex PDF, Word, etc. files.

In a fit of desperation I have added log.info statements into the filesreferenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I seethose in the log when I submit the zipfile with CURL, so I know I'mrunning those patched files in the build.


If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.


On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,
Thanks for the info - been keeping an eye on this list in case thistopic cropped up again. It's currently a background task for me, soI'll try and take a look at the patches and re-test soon.
Joey - glad you brought this issue up again. I haven't progressed anyfurther with it. I've not yet moved to Solr 3.1 but it's on my to-dolist, as is testing out the patches referenced by Jayendra. I'll postmy findings on this thread - if you manage to test the patches beforeme, let me know how you get on.
Thanks and kind regards,
Gary.


On 11/04/2011 05:02, Jayendra Patil wrote:
The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra
On Sun, Apr 10, 2011 at 10:35 PM, JoeyHanzel<phan...@nearinfinity.com> wrote:
Hi Gary,
I have been experiencing the same problem... Unable to extractcontent fromarchive file formats. I just tried again with a clean install ofSolr 3.1.0(using Tika 0.8) and continue to experience the same results. Didyou have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl "
http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true";
-H "application/octet-stream" -F  "myfile=@data.zip"
No problem extracting single rich text documents, but archive filesonly
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to beunpacking thearchive files. Based on the email chain associated with your firstmessage,some people have been able to get this functionality to work asdesired.



--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Reply via email to