date:20150313


[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361268#comment-14361268
 ] 

Tyler Palsulich commented on TIKA-891:
--

I took a stab at adding the {{@POST}} annotation to each {{@PUT}} resource. 
But, it didn't work. Turns out, you cannot support two HTTP request methods for 
the same resource. See [this 
blog|http://marxsoftware.blogspot.com/2010/02/playing-with-jerseyjax-rs-method.html#post-body-8986464064706467213]
 for an experiment. I couldn't seem to find anything in the [official 
documentation|http://docs.oracle.com/cd/E19798-01/821-1841/giepu/index.html]. 
But, the blog and my own experience seem like reason enough.

So, now the real question: Since we can't support {{@PUT}} and {{@POST}}, do we 
want to leave everything the way it is, or switch everything to {{@POST}}?

I'm leaning toward leaving the resources the way they are. But, {{@POST}}, as 
discussed above, isn't necessarily wrong...

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.9


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1106) CLAVIN Integration

[
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1106:

Component/s: (was: general)
parser

CLAVIN Integration
--

Key: TIKA-1106
URL: https://issues.apache.org/jira/browse/TIKA-1106
Project: Tika
Issue Type: Wish
Components: parser
Affects Versions: 1.3
Environment: All
Reporter: Adam Estrada
Assignee: Chris A. Mattmann
Priority: Minor
Labels: entity, geospatial, new-parser
Fix For: 1.8

I've been evaluating CLAVIN as a way to extract location information from
unstructured text. It seems like meshing it with Tika in some way would make
a lot of sense. From CLAVIN website...
{quote}
CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source
software package for document geotagging and geoparsing that employs
context-based geographic entity resolution. It combines a variety of open
source tools with natural language processing techniques to extract location
names from unstructured text documents and resolve them against gazetteer
records. Importantly, CLAVIN does not simply look up location names;
rather, it uses intelligent heuristics in an attempt to identify precisely
which Springfield (for example) was intended by the author, based on the
context of the document. CLAVIN also employs fuzzy search to handle
incorrectly-spelled location names, and it recognizes alternative names
(e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic
entity. By enriching text documents with structured geo data, CLAVIN enables
hierarchical geospatial search and advanced geospatial analytics on
unstructured data.
{quote}
There was only one other instance of the word clavin mentioned in the ASF
jira site so I thought it was definitely worth posting here.
https://github.com/Berico-Technologies/CLAVIN

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1106) CLAVIN Integration

[
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1106:

Issue Type: New Feature (was: Wish)

CLAVIN Integration
--

Key: TIKA-1106
URL: https://issues.apache.org/jira/browse/TIKA-1106
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.3
Environment: All
Reporter: Adam Estrada
Assignee: Chris A. Mattmann
Priority: Minor
Labels: entity, geospatial, new-parser
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (TIKA-1106) CLAVIN Integration

[
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann reassigned TIKA-1106:
---

Assignee: Chris A. Mattmann

CLAVIN Integration
--

Key: TIKA-1106
URL: https://issues.apache.org/jira/browse/TIKA-1106
Project: Tika
Issue Type: Wish
Components: general
Affects Versions: 1.3
Environment: All
Reporter: Adam Estrada
Assignee: Chris A. Mattmann
Priority: Minor
Labels: entity, geospatial, new-parser
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1106) CLAVIN Integration

[
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361500#comment-14361500
]

Chris A. Mattmann commented on TIKA-1106:
-

Have multiple students and folks working on this right now. Going to also take
a look. [~aashish24] please also look.

CLAVIN Integration
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-942) HTTP Accept header evaluator

[
https://issues.apache.org/jira/browse/TIKA-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tyler Palsulich closed TIKA-942.

Resolution: Won't Fix

Closing as Won't Fix, since the JAX-RS implementation automatically handles
this.

HTTP Accept header evaluator

Key: TIKA-942
URL: https://issues.apache.org/jira/browse/TIKA-942
Project: Tika
Issue Type: New Feature
Components: mime
Reporter: Jukka Zitting
Labels: HTTP

The HTTP Accept header
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) provides a flexible
mechanism for an HTTP client to express its preferences for different
response media types. Unfortunately processing Accept headers on the server
side is quite complicated because of the somewhat complicated syntax and the
possibility of media type inheritance relationships (can I respond with
application/xml if the client requests text/plain?).
The media type registry in Tika is perfect for resolving such cases, so I'd
like to introduce a new {{String resolveHttpAccept(String accept, String...
types)}} method in the Tika facade. The method would take the value of an
HTTP accept header and evaluate it against the given media types supported by
a server, using the configured media type registry for type inheritance
information. The method would then return the best match from among the given
media types, or {{application/octet-stream}} if none of the listed types
would be accepted by the client.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-651) Unescaped attribute value generated


 [ 
https://issues.apache.org/jira/browse/TIKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-651.
--
Resolution: Fixed

Marking as fixed per the above comment. And, it's been a week and a half with 
no objection.

 Unescaped attribute value generated
 ---

 Key: TIKA-651
 URL: https://issues.apache.org/jira/browse/TIKA-651
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Raimund Merkert
Assignee: Jukka Zitting
 Attachments: XHTMLSerializer.java


 I've converted a word document that contains hyperlinks with a complex query 
 component. The  character is not escaped and mozilla complains about that 
 when I write out the XHTML via a content handler that I wrote.
 It's not clear to me whether or not my contenthandler should assume 
 attributes are properly escaped or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1095) Only gibberish extracted from this PDF


[ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361451#comment-14361451
 ] 

Tyler Palsulich commented on TIKA-1095:
---

Just commented on PDFBOX-2451. Still have this issue with PDFBox 
1.8.9-SNAPSHOT, so we still have it in Tika.

 Only gibberish extracted from this PDF
 --

 Key: TIKA-1095
 URL: https://issues.apache.org/jira/browse/TIKA-1095
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Probably any
Reporter: Bas van Meurs
  Labels: pdfbox
 Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
 bestuur d d  10 februari 2010.pdf, test.txt


 java -jar /usr/share/tika/tika-app-1.3.jar -t 
 /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
 2010.pdf  /tmp/test.txt
 This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏


[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361461#comment-14361461
 ] 

Tyler Palsulich commented on TIKA-1098:
---

Tika still can't parse this file. I tried with PDFBox 1.8.9 SNAPSHOT, but hit 
the following exception:
{code}
➜  trunk  java -jar ~/Downloads/pdfbox.jar ExtractText ~/Downloads/test.pdf
Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2390 is wrong. Fall back to reading stream 
until 'endstream'.
Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference
ExtractText failed with the following exception:
java.io.IOException: Unknown dir object c='' cInt=62 peek='' peekInt=62 364863
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1362)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:249)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:356)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1264)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:641)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1129)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
{code}

Does anyone recognize this error? Or, should I open a new issue with PDFBox?

 not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
 

 Key: TIKA-1098
 URL: https://issues.apache.org/jira/browse/TIKA-1098
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: linux redhat
Reporter: Qian Diao
 Attachments: url_1763_approx-alg-notes.pdf


 Hi,
 I got some parsing problems when using Tika 1.1 for the attached pdf file.
 my code (Test.java):
 import java.io.File;
 import java.io.InputStream;
 import java.io.FileInputStream;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.html.BoilerpipeContentHandler;
 import org.apache.tika.sax.BodyContentHandler;
 import org.apache.tika.parser.html.HtmlParser;
 import de.l3s.boilerpipe.extractors.ArticleExtractor;
 public class Test {
 private static final String validBoilerpipeFilenameRegEx = 
 .*(\\.)(htm|html|shtml|php|asp|aspx)$;
 public String parseFile(File inFile) {
 if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
 return null;

 InputStream is = null;
 String outputText = ;
 try {
 // Open input stream
 is = new FileInputStream(inFile);
 // Prepare parser
 BodyContentHandler contenthandler = new 
 BodyContentHandler(-1);
 Metadata metadata = new Metadata();
 metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
 ParseContext pc = new ParseContext();
 // Call parse with boilerpipe if valid boilerpipe extension; 
 otherwise, call regular parse.
 if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
 Parser parser = new AutoDetectParser();
 parser.parse(is, contenthandler, metadata, pc);
 }
 else {
 Parser parser = new HtmlParser();
 BoilerpipeContentHandler bh = new 
 BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
 parser.parse(is, bh, metadata, pc);
 }
 // Prepare text for write
 outputText = contenthandler.toString();
 } catch (Exception e) {
 System.out.println(e);
 return null;
 } finally {
 try { 
 if (is != null) 
 is.close(); 
 } catch (Exception e) {}
 }

 return outputText;
 }
 =output

[jira] [Updated] (TIKA-1106) CLAVIN Integration


 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1106:
--
Labels: entity geospatial new-parser  (was: entity geospatial)

 CLAVIN Integration
 --

 Key: TIKA-1106
 URL: https://issues.apache.org/jira/browse/TIKA-1106
 Project: Tika
  Issue Type: Wish
  Components: general
Affects Versions: 1.3
 Environment: All
Reporter: Adam Estrada
Priority: Minor
  Labels: entity, geospatial, new-parser
 Fix For: 1.8


 I've been evaluating CLAVIN as a way to extract location information from 
 unstructured text. It seems like meshing it with Tika in some way would make 
 a lot of sense. From CLAVIN website...
 {quote}
 CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
 software package for document geotagging and geoparsing that employs 
 context-based geographic entity resolution. It combines a variety of open 
 source tools with natural language processing techniques to extract location 
 names from unstructured text documents and resolve them against gazetteer 
 records. Importantly, CLAVIN does not simply look up location names; 
 rather, it uses intelligent heuristics in an attempt to identify precisely 
 which Springfield (for example) was intended by the author, based on the 
 context of the document. CLAVIN also employs fuzzy search to handle 
 incorrectly-spelled location names, and it recognizes alternative names 
 (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic 
 entity. By enriching text documents with structured geo data, CLAVIN enables 
 hierarchical geospatial search and advanced geospatial analytics on 
 unstructured data.
 {quote}
 There was only one other instance of the word clavin mentioned in the ASF 
 jira site so I thought it was definitely worth posting here.
 https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1069) Tika TXTParser requires (writeLimit + 1) because of paragrah wrapping


[ 
https://issues.apache.org/jira/browse/TIKA-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361344#comment-14361344
 ] 

Tyler Palsulich commented on TIKA-1069:
---

Reproduced this with 1.8-SNAPSHOT. If a text file has {{writeLimit - 1}} 
characters, it's fine. But, if it has exactly {{writeLimit}} characters, an 
exception is thrown which says the document contained more than {{writeLimit}} 
characters, which isn't true. Thoughts?

 Tika TXTParser requires (writeLimit + 1) because of paragrah wrapping
 -

 Key: TIKA-1069
 URL: https://issues.apache.org/jira/browse/TIKA-1069
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Horia Chiorean

 When using a configured {{writeLimit}} with {{BodyContentHandler}}, together 
 with the {{TXTParser}}, the latter always throws a 
 {{WriteLimitReachedException}} from the {{ignorableWhitespace}} method, when 
 parsing a text which has the {{writeLimit}} length.
 From what I can tell so far, this is caused by the fact that the 
 {{TXTParser}} wraps the text in a p element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1071) Some spaces are omitted when merging two lines of a paragraph


 [ 
https://issues.apache.org/jira/browse/TIKA-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1071.
---
Resolution: Fixed

Closing as fixed:

{code}
tika http://www.amf-france.org/documents/general/8091_1.pdf | grep commeinclu
{code}

has no output.

 Some spaces are omitted when merging two lines of a paragraph
 -

 Key: TIKA-1071
 URL: https://issues.apache.org/jira/browse/TIKA-1071
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Guillaume Vauvert
 Attachments: RG_AMF_Livre_IV_8091_1.pdf


 Tika 1.3 sometimes append two succesive lines without inserting a space, 
 while Tika 1.2 does not make the error.
 Document :
 http://www.amf-france.org/documents/general/8091_1.pdf
 With Tika 1.3 :
 Page 2, Paragraph:4° La référence aux « membres du conseil d'administration 
 ou directoire de la SICAV » doit s'entendre commeincluant, le cas échéant, le 
 président de la société par actions simplifiée ou celui ou ceux de ses 
 dirigeants que lesstatuts désignent pour exercer les attributions du conseil 
 d'administration conformément aux dispositions de l'articleL. 227-1 du code 
 de commerce. 
 commeincluant should be comme incluant
 With Tika 1.2 :
 Page 2, Paragraph:4° La référence aux « membres du conseil d'administration 
 ou directoire de la SICAV » doit s'entendre comme incluant, le cas échéant, 
 le président de la société par actions simplifiée ou celui ou ceux de ses 
 dirigeants que les statuts désignent pour exercer les attributions du conseil 
 d'administration conformément aux dispositions de l'article L. 227-1 du code 
 de commerce.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Parser test resources

2015-03-13 Thread Tyler Palsulich

Good points. Maybe it's a good idea to keep the new files organized, like
chm, but leave the old ones where they are? The test-documents directory
has 460 entries right now.

Tyler

On Wed, Mar 11, 2015 at 8:43 AM, Nick Burch apa...@gagravarr.org wrote:

 On Tue, 10 Mar 2015, Tyler Palsulich wrote:

 Or, do enough parsers have overlapping test resource dependencies where
 it makes sense to have them _all_ under one directory?


 I believe that most of the test files get used for both detection and
 parsing unit tests

  It would be nice to easily know which files are used for which tests.


 5 lines of perl should give you that, or fewer if you don't want to be
 able to understand the perl... ;-)

 Many, but not all of the test files are of the form testfiletype.ext
 or testfiletype_special type/description.ext, which I find makes it
 fairly easy to spot what files go with what. Not all though. Would fixing
 the few files not in that format help, or hinder do you think?

 Nick

[jira] [Commented] (TIKA-1088) Unsupported AutoCAD drawing version: AC1009


[ 
https://issues.apache.org/jira/browse/TIKA-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361432#comment-14361432
 ] 

Tyler Palsulich commented on TIKA-1088:
---

Related to TIKA-1045, support for another AutoCAD version.

 Unsupported AutoCAD drawing version: AC1009
 ---

 Key: TIKA-1088
 URL: https://issues.apache.org/jira/browse/TIKA-1088
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Hardik Upadhyay
  Labels: new-parser
 Attachments: 227051.dwg


 Tika parser version 1.2 and 1.3 fails to parse DWG file version AC1009.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder


[ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361277#comment-14361277
 ] 

Tyler Palsulich commented on TIKA-1059:
---

Is this feature still worth implementing?

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.8


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1063) OpenDocument basic style support

2015-03-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361334#comment-14361334
 ] 

Hudson commented on TIKA-1063:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #547 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/547/])
TIKA-1063. Add basic ODF style support, contributed by Axel Dörfler. 
(tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=107)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java


 OpenDocument basic style support
 

 Key: TIKA-1063
 URL: https://issues.apache.org/jira/browse/TIKA-1063
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Axel Dörfler
Priority: Minor
 Attachments: testStyles.odt, tika-opendocument-styles.patch


 I've added basic support for list and text styles. Paragraph styles are 
 omitted on purpose -- one could use the style names as class names, though.
 Only bold, italic, and underlined text is supported.
 Lists now differentiate between ordered and unordered lists.
 Test case included. I've also changed the ODFParserTest to make a bit more 
 use of the methods of its super class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1063) OpenDocument basic style support

2015-03-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361422#comment-14361422
 ] 

Hudson commented on TIKA-1063:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #548 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/548/])
TIKA-1063. Add ODF style test resource file. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=110)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testStyles.odt


 OpenDocument basic style support
 

 Key: TIKA-1063
 URL: https://issues.apache.org/jira/browse/TIKA-1063
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Axel Dörfler
Priority: Minor
 Attachments: testStyles.odt, tika-opendocument-styles.patch


 I've added basic support for list and text styles. Paragraph styles are 
 omitted on purpose -- one could use the style names as class names, though.
 Only bold, italic, and underlined text is supported.
 Lists now differentiate between ordered and unordered lists.
 Test case included. I've also changed the ODFParserTest to make a bit more 
 use of the methods of its super class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries


 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1079.
---
Resolution: Fixed

There is no more exception with Tika 1.8-SNAPSHOT. So, I'm marking this as 
fixed.

 Word document hits AIOOBE in SummaryExtractor.parseSummaries
 

 Key: TIKA-1079
 URL: https://issues.apache.org/jira/browse/TIKA-1079
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc


 I'm not yet sure if this is a corrupted document (though, MS Word opens it 
 just fine) or a bug in POI ... but I hit this exc when running it through 
 TikaCLI:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: -1
   at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161)
   at 
 org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
   at org.apache.poi.hpsf.Property.init(Property.java:164)
   at org.apache.poi.hpsf.Section.init(Section.java:277)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1077) JAXRS Server meta-resource should be able to return metadata in JSON or XML


[ 
https://issues.apache.org/jira/browse/TIKA-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361423#comment-14361423
 ] 

Tyler Palsulich commented on TIKA-1077:
---

Related to TIKA-944.

 JAXRS Server meta-resource should be able to return metadata in JSON or XML
 ---

 Key: TIKA-1077
 URL: https://issues.apache.org/jira/browse/TIKA-1077
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.2, 1.3
 Environment: (all)
Reporter: Aaron Weber
Priority: Minor
  Labels: JAXRS, json, metadata, service, xml

 Tika JAXRS /meta resource currently returns plain text output of PUT file's 
 metadata (properties).  Based upon the http request's ACCEPT header, this 
 resource should return either the current format (i.e. default), JSON 
 formatted, or XML formatted metadata.
 This resource should set the return/output header to match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

Tim Allison created TIKA-1575:
-

 Summary: Upgrade to PDFBox 1.8.9 when available
 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


The PDFBox community is about to release 1.8.9.  Let's use this issue to track 
discussions before the release and to track Tika's upgrade to PDFBox 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available


 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx

[~tilman], thank you, again, for pinging me on the impending release of PDFBox 
1.8.9.  And, also thanks to you, I've turned on the AccessChecker, so you 
shouldn't see any content from files that don't allow extraction.

I ran the most recent eval code against all files that end in a pdf extension 
in govdocs1.

I've included in the xlsx file all files with some kind of an exception or with 
any difference in attachment counts, metadata value counts, lang id or content.

I've also included an example of a static dump of reports from the comparison 
database.  More work remains on that...

I haven't had a chance to join in your earlier comments from our work on the 
1.8.8 release.  Many apologies!

My quick impression:
1) no differences in attachments
2) no differences in metadata values
3) 1.8.9 fixed 3 null pointer exceptions, no new exceptions
4) Content wise:
  a) with 1.8.9 we're getting less form field info (looks like internal 
field names? More digging is required...)
  b) might be actual modest regressions with 
147/147012.pdf
223/223704.pdf


 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available


[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360492#comment-14360492
 ] 

Tim Allison edited comment on TIKA-1575 at 3/13/15 3:47 PM:


Form clutter...This was embedded inside 776568.

With PDFBox 1.8.8, we extracted the keys for the subform (but there was no 
meaningful content in this doc):
{noformat}Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n
19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: 
\n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: 
\n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: 
\n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: 
\n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: 
\n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: 
\n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: 
\n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: 
\n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: 
\n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: 
\n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: 
\n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: 
\n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: 
\n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: 
\n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: 
\n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: 
\n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: 
\n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: 
\n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: 
\n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: 
\n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: 
\n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: 
\n\tCheckBox6[11]: \n\n\n\n\n,{noformat}

In 1.8.9, there's just this:
{noformat}
Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n 19\n\n\n\tform1[0]: 
\n\n\n\n
{noformat}


There's no difference with PDFBox app's ExtractText between 1.8.8 and 1.8.9 on 
this file.


was (Author: talli...@mitre.org):
Form clutter...This was embedded inside 776568.

With PDFBox 1.8.8, we extracted the keys for the subform (but there was no 
meaningful content in this doc):
{noformat}Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n
19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: 
\n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: 
\n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: 
\n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: 
\n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: 
\n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: 
\n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: 
\n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: 
\n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: 
\n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: 
\n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: 
\n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: 
\n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: 
\n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: 
\n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: 
\n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: 
\n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: 
\n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: 
\n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: 
\n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: 
\n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: 
\n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: 
\n\tCheckBox6[11]: \n\n\n\n\n,{noformat}

In 1.8.9, there's just this:
{noformat}
Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n 19\n\n\n\tform1[0]: 
\n\n\n\n
{noformat}


 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 10-814_Appendix B_v3.pdf, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip


 The

[jira] [Commented] (TIKA-682) Creative Suite formats support

2015-03-13 Thread Christopher Dedels (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360487#comment-14360487
 ] 

Christopher Dedels commented on TIKA-682:
-

Yes, that magic matches all the InDesign files I looked at.  Thanks for the 
info!

 Creative Suite formats support
 --

 Key: TIKA-682
 URL: https://issues.apache.org/jira/browse/TIKA-682
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.8
Reporter: Vivian Li
  Labels: new-parser
 Attachments: Untitled-1.indd, myfile.psd, myfile.xmp


 Is it possible to support Creative Suite formats, such as PSD, InDesign, 
 etc.? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available