[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031011#comment-13031011
 ] 

Liam O'Boyle commented on SOLR-2424:
------------------------------------

I am experiencing the same problem with another PDF, this one apparent created 
by "Adobe Acrobat 8.1 Combine Files" (or so says the metadata that Tika 
extracts).

Running the tika app jar instead correctly spaces all of the same terms.

Metadata snippet follows, if it's of any help; the document in question was 
provided by a client so I cannot pass it on.

"ET2000 Service Manual.pdf_metadata":[
    "xmpTPg:NPages",["14"],
    "Creation-Date",["2011-02-25T04:07:28Z"],
    "title",["et2000 cover"],
    "stream_source_info",["tutorial"],
    "created",["Fri Feb 25 15:07:28 EST 2011"],
    "stream_content_type",["application/octet-stream"],
    "stream_size",["9295420"],
    "Last-Modified",["2011-02-25T04:07:28Z"],
    "producer",["Adobe Acrobat 8.1"],
    "stream_name",["ET2000 Service Manual.pdf"],
    "Content-Type",["application/pdf"],
    "creator",["Adobe Acrobat 8.1 Combine Files"]
]

> extracted text from tika has no spaces
> --------------------------------------
>
>                 Key: SOLR-2424
>                 URL: https://issues.apache.org/jira/browse/SOLR-2424
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.1
>            Reporter: Yonik Seeley
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to