subject:"Extracting contents of zipped files with Tika and Solr 1.4.1"

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-23 Thread Gary Taylor


Jayendra,

I cleared out my local repository, and replayed all of my steps from 
Friday and it now it works.  The only difference (or the only one that's 
obvious to me) was that I applied the patch before doing a full 
compile/test/dist.  But I assumed that given I was seeing my new log 
entries (from ExtractingDocumentLoader.java) I was running the correct 
code anyway.


However, I'm very pleased that it's working now - I get the full 
contents of the zipped files indexed and not just the file names.


Thank you again for your assistance, and the patch!

Kind regards,
Gary.


On 21/05/2011 03:12, Jayendra Patil wrote:

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl 
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Gary Taylor

Hello again. Unfortunately, I'm still getting nowhere with this. I
have checked-out the 3.1 source and applied Jayendra's patches (see
below) and it still appears that the contents of the files in the
zipfile are not being indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F commit=true -F file=@solr1.zip

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm
expecting the contents of those txt files to be extracted from the zip
and indexed, but this isn't happening - or at least, I don't get the
desired result when I do a query afterwards. I do get a match if I
search for either doc1.txt or doc2.txt, but not if I search for a
word that appears in their contents.

If I index one of the txt files (instead of the zipfile), I can query
the content OK, so I'm assuming my query is sensible and matches the
field specified on the CURL string (ie. text). I'm also happy that
the Solr Cell content extraction is working because I can successfully
index PDF, Word, etc. files.

In a fit of desperation I have added log.info statements into the files
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see
those in the log when I submit the zipfile with CURL, so I know I'm
running those patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this
topic cropped up again. It's currently a background task for me, so
I'll try and take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do
list, as is testing out the patches referenced by Jayendra. I'll post
my findings on this thread - if you manage to test the patches before
me, let me know how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey
Hanzelphan...@nearinfinity.com wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract
content from
archive file formats. I just tried again with a clean install of
Solr 3.1.0
(using Tika 0.8) and continue to experience the same results. Did
you have

any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;

-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files
only

result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be
unpacking the
archive files. Based on the email chain associated with your first
message,
some people have been able to get this functionality to work as
desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Jayendra Patil

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

On Fri, May 20, 2011 at 11:15 AM, Gary Taylor g...@inovem.com wrote:
Hello again. Unfortunately, I'm still getting nowhere with this. I have
checked-out the 3.1 source and applied Jayendra's patches (see below) and it
still appears that the contents of the files in the zipfile are not being
indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F commit=true -F file=@solr1.zip

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm
expecting the contents of those txt files to be extracted from the zip and
indexed, but this isn't happening - or at least, I don't get the desired
result when I do a query afterwards. I do get a match if I search for
either doc1.txt or doc2.txt, but not if I search for a word that appears
in their contents.

If I index one of the txt files (instead of the zipfile), I can query the
content OK, so I'm assuming my query is sensible and matches the field
specified on the CURL string (ie. text). I'm also happy that the Solr
Cell content extraction is working because I can successfully index PDF,
Word, etc. files.

In a fit of desperation I have added log.info statements into the files
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those
in the log when I submit the zipfile with CURL, so I know I'm running those
patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this topic
cropped up again. It's currently a background task for me, so I'll try and
take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do list,
as is testing out the patches referenced by Jayendra. I'll post my findings
on this thread - if you manage to test the patches before me, let me know
how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com
wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content
from
archive file formats. I just tried again with a clean install of Solr
3.1.0
(using Tika 0.8) and continue to experience the same results. Did you
have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl

http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking
the
archive files. Based on the email chain associated with your first
message,
some people have been able to get this functionality to work as desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Gary Taylor

Jayendra,

Thanks for the info - been keeping an eye on this list in case this
topic cropped up again. It's currently a background task for me, so
I'll try and take a look at the patches and re-test soon.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats. I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results. Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Joey Hanzel

Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416

On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil 
jayendra.patil@gmail.com wrote:

 The migration of Tika to the latest 0.8 version seems to have
 reintroduced the issue.

 I was able to get this working again with the following patches. (Solr
 Cell and Data Import handler)

 https://issues.apache.org/jira/browse/SOLR-2416
 https://issues.apache.org/jira/browse/SOLR-2332

 You can try these.

 Regards,
 Jayendra

 On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com
 wrote:
  Hi Gary,
 
  I have been experiencing the same problem... Unable to extract content
 from
  archive file formats.  I just tried again with a clean install of Solr
 3.1.0
  (using Tika 0.8) and continue to experience the same results.  Did you
 have
  any success with this problem with Solr 1.4.1 or 3.1.0 ?
 
  I'm using this curl command to send data to Solr.
  curl 
 
 http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true
 
  -H application/octet-stream -F  myfile=@data.zip
 
  No problem extracting single rich text documents, but archive files only
  result in the file names within the archive being indexed. Am I missing
  something else in my configuration? Solr doesn't seem to be unpacking the
  archive files. Based on the email chain associated with your first
 message,
  some people have been able to get this functionality to work as desired.
 
  On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote:
 
  Can anyone shed any light on this, and whether it could be a config
 issue?
   I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
 
  When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
 to
  the ExtractingRequestHandler, I get the following log entry (formatted
 for
  ease of reading) :
 
  SolrInputDocument[
 {
 ignored_meta=ignored_meta(1.0)={
 [stream_source_info, file, stream_content_type,
  application/octet-stream, stream_size, 260, stream_name, solr1.zip,
  Content-Type, application/zip]
 },
 ignored_=ignored_(1.0)={
 [package-entry, package-entry]
 },
 ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
 
 
  
 ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
 
 ignored_stream_size=ignored_stream_size(1.0)={260},
 ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
 ignored_content_type=ignored_content_type(1.0)={application/zip},
 docid=docid(1.0)={74},
 type=type(1.0)={5},
 text=text(1.0)={  doc2.txtdoc1.txt}
 }
  ]
 
  So, the data coming back from Tika when parsing a ZIP file does not
 include
  the file contents, only the names of the files contained therein.  I've
  tried forcing stream.type=application/zip in the CURL string, but that
 makes
  no difference.  If I specify an invalid stream.type then I get an
 exception
  response, so I know it's being used.
 
  When I send one of those txt files individually to the
  ExtractingRequestHandler, I get:
 
  SolrInputDocument[
 {
 ignored_meta=ignored_meta(1.0)={
 [stream_source_info, file, stream_content_type, text/plain,
  stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
 },
 ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
 
 
  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
 ignored_stream_size=ignored_stream_size(1.0)={30},
 ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
 ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
 docid=docid(1.0)={74},
 type=type(1.0)={5},
 text=text(1.0)={The quick brown fox  }
 }
  ]
 
  and we see the file contents in the text field.
 
  I'm using the following requestHandler definition in solrconfig.xml:
 
  !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler--
  requestHandler name=/update/extract
  class=org.apache.solr.handler.extraction.ExtractingRequestHandler
  startup=lazy
  lst name=defaults
  !-- All the main content goes into text... if you need to return
the extracted text or do highlighting, use a stored field. --
  str name=fmap.contenttext/str
  str name=lowernamestrue/str
  str name=uprefixignored_/str
 
  !-- capture link hrefs but ignore div attributes --
  str name=captureAttrtrue/str
  str name=fmap.alinks/str
  str name=fmap.divignored_/str
  /lst
  /requestHandler
 
  Is there any further debug or diagnostic I can get out of Tika to help
 me
  work out why it's only returning the file names and not the file
 contents
  when parsing a ZIP file?
 
 
  Thanks and kind regards,

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Joey Hanzel

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl 
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F  myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote:

 Can anyone shed any light on this, and whether it could be a config issue?
  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.

 When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
 the ExtractingRequestHandler, I get the following log entry (formatted for
 ease of reading) :

 SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type,
 application/octet-stream, stream_size, 260, stream_name, solr1.zip,
 Content-Type, application/zip]
},
ignored_=ignored_(1.0)={
[package-entry, package-entry]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  
 ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},

ignored_stream_size=ignored_stream_size(1.0)={260},
ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
ignored_content_type=ignored_content_type(1.0)={application/zip},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={  doc2.txtdoc1.txt}
}
 ]

 So, the data coming back from Tika when parsing a ZIP file does not include
 the file contents, only the names of the files contained therein.  I've
 tried forcing stream.type=application/zip in the CURL string, but that makes
 no difference.  If I specify an invalid stream.type then I get an exception
 response, so I know it's being used.

 When I send one of those txt files individually to the
 ExtractingRequestHandler, I get:

 SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, text/plain,
 stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
ignored_stream_size=ignored_stream_size(1.0)={30},
ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={The quick brown fox  }
}
 ]

 and we see the file contents in the text field.

 I'm using the following requestHandler definition in solrconfig.xml:

 !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --
 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 startup=lazy
 lst name=defaults
 !-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
 str name=fmap.contenttext/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str

 !-- capture link hrefs but ignore div attributes --
 str name=captureAttrtrue/str
 str name=fmap.alinks/str
 str name=fmap.divignored_/str
 /lst
 /requestHandler

 Is there any further debug or diagnostic I can get out of Tika to help me
 work out why it's only returning the file names and not the file contents
 when parsing a ZIP file?


 Thanks and kind regards,
 Gary.



 On 25/01/2011 16:48, Jayendra Patil wrote:

 Hi Gary,

 The latest Solr Trunk was able to extract and index the contents of the
 zip
 file using the ExtractingRequestHandler.
 The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
 worked pretty well.

 Tested again with sample url and works fine -
 curl 

 http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true
 

 You would probably need to drill down to the Tika Jars and
 the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

 Regards,
 Jayendra

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Jayendra Patil

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com wrote:
 Hi Gary,

 I have been experiencing the same problem... Unable to extract content from
 archive file formats.  I just tried again with a clean install of Solr 3.1.0
 (using Tika 0.8) and continue to experience the same results.  Did you have
 any success with this problem with Solr 1.4.1 or 3.1.0 ?

 I'm using this curl command to send data to Solr.
 curl 
 http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
 -H application/octet-stream -F  myfile=@data.zip

 No problem extracting single rich text documents, but archive files only
 result in the file names within the archive being indexed. Am I missing
 something else in my configuration? Solr doesn't seem to be unpacking the
 archive files. Based on the email chain associated with your first message,
 some people have been able to get this functionality to work as desired.

 On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote:

 Can anyone shed any light on this, and whether it could be a config issue?
  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.

 When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
 the ExtractingRequestHandler, I get the following log entry (formatted for
 ease of reading) :

 SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
        [stream_source_info, file, stream_content_type,
 application/octet-stream, stream_size, 260, stream_name, solr1.zip,
 Content-Type, application/zip]
        },
    ignored_=ignored_(1.0)={
        [package-entry, package-entry]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},

    ignored_stream_size=ignored_stream_size(1.0)={260},
    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
    ignored_content_type=ignored_content_type(1.0)={application/zip},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                  doc2.txt    doc1.txt    }
    }
 ]

 So, the data coming back from Tika when parsing a ZIP file does not include
 the file contents, only the names of the files contained therein.  I've
 tried forcing stream.type=application/zip in the CURL string, but that makes
 no difference.  If I specify an invalid stream.type then I get an exception
 response, so I know it's being used.

 When I send one of those txt files individually to the
 ExtractingRequestHandler, I get:

 SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
        [stream_source_info, file, stream_content_type, text/plain,
 stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
    ignored_stream_size=ignored_stream_size(1.0)={30},
    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                The quick brown fox  }
    }
 ]

 and we see the file contents in the text field.

 I'm using the following requestHandler definition in solrconfig.xml:

 !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --
 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 startup=lazy
 lst name=defaults
 !-- All the main content goes into text... if you need to return
           the extracted text or do highlighting, use a stored field. --
 str name=fmap.contenttext/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str

 !-- capture link hrefs but ignore div attributes --
 str name=captureAttrtrue/str
 str name=fmap.alinks/str
 str name=fmap.divignored_/str
 /lst
 /requestHandler

 Is there any further debug or diagnostic I can get out of Tika to help me
 work out why it's only returning the file names and not the file contents
 when parsing a ZIP file?


 Thanks and kind regards,
 Gary.



 On 25/01/2011 16:48, Jayendra Patil wrote:

 Hi Gary,

 The latest Solr Trunk was able to extract and index the contents of the
 zip
 file using the ExtractingRequestHandler.
 The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
 worked pretty well.

 Tested again with sample url and works fine -
 curl 

 http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true
 

 You would probably need to

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-31 Thread Gary Taylor

Can anyone shed any light on this, and whether it could be a config 
issue?  I'm now using the latest SVN trunk, which includes the Tika 0.8 
jars.


When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) 
to the ExtractingRequestHandler, I get the following log entry 
(formatted for ease of reading) :


SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, 
application/octet-stream, stream_size, 260, stream_name, solr1.zip, 
Content-Type, application/zip]

},
ignored_=ignored_(1.0)={
[package-entry, package-entry]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, 


ignored_stream_size=ignored_stream_size(1.0)={260},
ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
ignored_content_type=ignored_content_type(1.0)={application/zip},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={  doc2.txtdoc1.txt}
}
]

So, the data coming back from Tika when parsing a ZIP file does not 
include the file contents, only the names of the files contained 
therein.  I've tried forcing stream.type=application/zip in the CURL 
string, but that makes no difference.  If I specify an invalid 
stream.type then I get an exception response, so I know it's being used.


When I send one of those txt files individually to the 
ExtractingRequestHandler, I get:


SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, text/plain, 
stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]

},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},

ignored_stream_size=ignored_stream_size(1.0)={30},
ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={The quick brown fox  }
}
]

and we see the file contents in the text field.

I'm using the following requestHandler definition in solrconfig.xml:

!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --
requestHandler name=/update/extract 
class=org.apache.solr.handler.extraction.ExtractingRequestHandler 
startup=lazy

lst name=defaults
!-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
str name=fmap.contenttext/str
str name=lowernamestrue/str
str name=uprefixignored_/str

!-- capture link hrefs but ignore div attributes --
str name=captureAttrtrue/str
str name=fmap.alinks/str
str name=fmap.divignored_/str
/lst
/requestHandler

Is there any further debug or diagnostic I can get out of Tika to help 
me work out why it's only returning the file names and not the file 
contents when parsing a ZIP file?


Thanks and kind regards,
Gary.



On 25/01/2011 16:48, Jayendra Patil wrote:

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl 
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true


You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Hi,

I posted a question in November last year about indexing content from 
multiple binary files into a single Solr document and Jayendra responded 
with a simple solution to zip them up and send that single file to Solr.


I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't 
currently allow this to work and only the file names of the zipped files 
are indexed (and not their contents).


I've tried downloading and building the latest Tika (0.8) and replacing 
the tika-parsers and tika-core JARS in 
solr-root\contrib\extraction\lib but this still isn't indexing the 
file contents, and not doesn't even index the file names!


Is there a version of Tika that works with the Solr 1.4.1 released 
distribution which does index the contents of the zipped files?


Thanks and kind regards,
Gary

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen


On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the old 
0.4 version of Tika, but you need a newer Tika version (0.8) in order to 
fetch the main content as well. So try the newest Solr version from trunk.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 
code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as
text files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I index
either of them individually using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@doc1.txt

and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@solr1.zip

and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used
standalone with the tika-app jar - it outputs both the filenames and
contents. Should I be able to index the contents of files stored in a
zip by using extract ?

Thanks and kind regards,
Gary.

On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest
trunk code.

Now I'm getting an error when trying to access the admin page (via
Jetty) because I specify HTMLStripStandardTokenizerFactory in my
schema.xml, but this appears to be no-longer supplied as part of the
build so I get an exception cos it can't find that class. I've
checked the CHANGES.txt and found the following in the change list to
1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does
anyone have an example fieldType stanza (for schema.xml) for stripping
out HTML ?

Thanks and kind regards,
Gary.

On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file.
Title and other kinds of metadata are successfully extracted by the
old 0.4 version of Tika, but you need a newer Tika version (0.8) in
order to fetch the main content as well. So try the newest Solr
version from trunk.

Erlend

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor g...@inovem.com wrote: