[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2012-03-19 Thread Yonik Seeley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232611#comment-13232611
 ] 

Yonik Seeley commented on SOLR-2424:


Yes, I just confirmed that the command given in the description no longer 
results in text w/o spaces.

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
> Fix For: 3.5
>
> Attachments: ET2000 Service Manual.pdf
>
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2012-03-16 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231754#comment-13231754
 ] 

Jan Høydahl commented on SOLR-2424:
---

Can someone verify if this issue is already fixed with the upgrade to 
Tika0.9/1.0?

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
> Attachments: ET2000 Service Manual.pdf
>
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2011-05-29 Thread Liam O'Boyle (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040969#comment-13040969
 ] 

Liam O'Boyle commented on SOLR-2424:


Hi, sorry for the slow response, I don't seem to be receiving notifications of 
updates.  

You are correct; I used the Tika 0.9 command line tool, which worked correctly. 
 When I tried the 0.8 version the same problem occurs as is described in this 
ticket, so it appears that the bug is in Tika and that it is already resolved 
in the 0.9 release.

I'll try to update the version of Tika in use in my installation, although it's 
something that has caused more problems than it has solved when I've tried it 
in the past.

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
> Attachments: ET2000 Service Manual.pdf
>
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2011-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034886#comment-13034886
 ] 

Andrzej Bialecki  commented on SOLR-2424:
-

Liam, what version of the cmd-line tika app did you use for this test? was it 
the exact same version as the one in Solr?

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
> Attachments: ET2000 Service Manual.pdf
>
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2424) extracted text from tika has no spaces

2011-05-09 Thread Liam O'Boyle (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031011#comment-13031011
 ] 

Liam O'Boyle commented on SOLR-2424:


I am experiencing the same problem with another PDF, this one apparent created 
by "Adobe Acrobat 8.1 Combine Files" (or so says the metadata that Tika 
extracts).

Running the tika app jar instead correctly spaces all of the same terms.

Metadata snippet follows, if it's of any help; the document in question was 
provided by a client so I cannot pass it on.

"ET2000 Service Manual.pdf_metadata":[
"xmpTPg:NPages",["14"],
"Creation-Date",["2011-02-25T04:07:28Z"],
"title",["et2000 cover"],
"stream_source_info",["tutorial"],
"created",["Fri Feb 25 15:07:28 EST 2011"],
"stream_content_type",["application/octet-stream"],
"stream_size",["9295420"],
"Last-Modified",["2011-02-25T04:07:28Z"],
"producer",["Adobe Acrobat 8.1"],
"stream_name",["ET2000 Service Manual.pdf"],
"Content-Type",["application/pdf"],
"creator",["Adobe Acrobat 8.1 Combine Files"]
]

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2424) extracted text from tika has no spaces

2011-03-14 Thread Eric Pugh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006475#comment-13006475
 ] 

Eric Pugh commented on SOLR-2424:
-

I went and updated and gave it a stab, I still get the class name repeated:

"foo_txt":[
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "page",
  "   Thisdocument"..


But without the underlying content in the  tag being included except as a 
big block at the end.  Are you seeing the same thing?

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2424) extracted text from tika has no spaces

2011-03-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006102#comment-13006102
 ] 

Yonik Seeley commented on SOLR-2424:


Note: this does not affect Solr 1.4
I also tried it with a different PDF and didn't see the issue, so this could 
just be a bug with
Tika0.8 and forrest generated PDFs.

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org