[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232611#comment-13232611 ] Yonik Seeley commented on SOLR-2424: Yes, I just confirmed that the command given in the description no longer results in text w/o spaces. > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > Fix For: 3.5 > > Attachments: ET2000 Service Manual.pdf > > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231754#comment-13231754 ] Jan Høydahl commented on SOLR-2424: --- Can someone verify if this issue is already fixed with the upgrade to Tika0.9/1.0? > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > Attachments: ET2000 Service Manual.pdf > > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040969#comment-13040969 ] Liam O'Boyle commented on SOLR-2424: Hi, sorry for the slow response, I don't seem to be receiving notifications of updates. You are correct; I used the Tika 0.9 command line tool, which worked correctly. When I tried the 0.8 version the same problem occurs as is described in this ticket, so it appears that the bug is in Tika and that it is already resolved in the 0.9 release. I'll try to update the version of Tika in use in my installation, although it's something that has caused more problems than it has solved when I've tried it in the past. > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > Attachments: ET2000 Service Manual.pdf > > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034886#comment-13034886 ] Andrzej Bialecki commented on SOLR-2424: - Liam, what version of the cmd-line tika app did you use for this test? was it the exact same version as the one in Solr? > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > Attachments: ET2000 Service Manual.pdf > > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031011#comment-13031011 ] Liam O'Boyle commented on SOLR-2424: I am experiencing the same problem with another PDF, this one apparent created by "Adobe Acrobat 8.1 Combine Files" (or so says the metadata that Tika extracts). Running the tika app jar instead correctly spaces all of the same terms. Metadata snippet follows, if it's of any help; the document in question was provided by a client so I cannot pass it on. "ET2000 Service Manual.pdf_metadata":[ "xmpTPg:NPages",["14"], "Creation-Date",["2011-02-25T04:07:28Z"], "title",["et2000 cover"], "stream_source_info",["tutorial"], "created",["Fri Feb 25 15:07:28 EST 2011"], "stream_content_type",["application/octet-stream"], "stream_size",["9295420"], "Last-Modified",["2011-02-25T04:07:28Z"], "producer",["Adobe Acrobat 8.1"], "stream_name",["ET2000 Service Manual.pdf"], "Content-Type",["application/pdf"], "creator",["Adobe Acrobat 8.1 Combine Files"] ] > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006475#comment-13006475 ] Eric Pugh commented on SOLR-2424: - I went and updated and gave it a stab, I still get the class name repeated: "foo_txt":[ "page", "page", "page", "page", "page", "page", "page", "page", "page", "page", "page", " Thisdocument".. But without the underlying content in the tag being included except as a big block at the end. Are you seeing the same thing? > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2424) extracted text from tika has no spaces
[ https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006102#comment-13006102 ] Yonik Seeley commented on SOLR-2424: Note: this does not affect Solr 1.4 I also tried it with a different PDF and didn't see the issue, so this could just be a bug with Tika0.8 and forrest generated PDFs. > extracted text from tika has no spaces > -- > > Key: SOLR-2424 > URL: https://issues.apache.org/jira/browse/SOLR-2424 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) >Affects Versions: 3.1 >Reporter: Yonik Seeley > > Try this: > curl > "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true"; > -F "tutorial=@tutorial.pdf" > And you get text output w/o spaces: > "ThisdocumentcoversthebasicsofrunningSolru"... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org