[jira] [Created] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

2014-09-25 Thread JIRA
Theodor Sjöstedt created TIKA-1428:
--

 Summary: Microsoft Word 97 - 2003 (.doc) footnote references are 
Unicode Replacement Character
 Key: TIKA-1428
 URL: https://issues.apache.org/jira/browse/TIKA-1428
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.4
Reporter: Theodor Sjöstedt
Priority: Minor


Footnotes from {{.doc}} documents are extracted, but the references to the 
footnotes are replaced by the Unicode Replacement Character (�).

I have tried this in 1.4 and 1.6.

In 1.4, both reference in text and reference at footnote have been replaced.
In 1.6, reference in text has disappeared completely.
See attached image for original document, 1.4 Formatted text, and 1.6 Formatted 
text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

2014-09-25 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodor Sjöstedt updated TIKA-1428:
---
Attachment: TIKA-doc-footnotes-issue.png

Original document to the left. 
TIKA 1.4 in Center
TIKA 1.6 to the right

 Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement 
 Character
 -

 Key: TIKA-1428
 URL: https://issues.apache.org/jira/browse/TIKA-1428
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4, 1.6
Reporter: Theodor Sjöstedt
Priority: Minor
 Attachments: TIKA-doc-footnotes-issue.png


 Footnotes from {{.doc}} documents are extracted, but the references to the 
 footnotes are replaced by the Unicode Replacement Character (�).
 I have tried this in 1.4 and 1.6.
 In 1.4, both reference in text and reference at footnote have been replaced.
 In 1.6, reference in text has disappeared completely.
 See attached image for original document, 1.4 Formatted text, and 1.6 
 Formatted text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

2014-09-25 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147880#comment-14147880
 ] 

Hong-Thai Nguyen commented on TIKA-1428:


Thanks [~theoettheo], any chance to have a patch with a test case for this 
problem ?

 Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement 
 Character
 -

 Key: TIKA-1428
 URL: https://issues.apache.org/jira/browse/TIKA-1428
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4, 1.6
Reporter: Theodor Sjöstedt
Priority: Minor
 Attachments: TIKA-doc-footnotes-issue.png


 Footnotes from {{.doc}} documents are extracted, but the references to the 
 footnotes are replaced by the Unicode Replacement Character (�).
 I have tried this in 1.4 and 1.6.
 In 1.4, both reference in text and reference at footnote have been replaced.
 In 1.6, reference in text has disappeared completely.
 See attached image for original document, 1.4 Formatted text, and 1.6 
 Formatted text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1330:
--
Attachment: TIKA-1330v1-patch.zip

This is the first version of tika-batch.  Much cleanup remains.

This first patch is intended to start the framework and offer concrete classes 
for filesystem (FS) handling...single input directory and single output 
directory.

I've included the patch against trunk, example log4jxml files, an example 
batch-config file and two sh scripts to kick off the two different processes.

Any and all feedback is welcomed!

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121454#comment-14121454
 ] 

Tim Allison edited comment on TIKA-1330 at 9/25/14 4:18 PM:


Started documentation on the 
[wiki|https://wiki.apache.org/tika/TikaBatchOverview].  Any and all feedback is 
welcomed.

Will post patch to rb (if possible) or to this issue some time next week.



was (Author: talli...@mitre.org):
Started documentation on the [wiki|https://wiki.apache.org/tika/TikaBatch].  
Any and all feedback is welcomed.

Will post patch to rb (if possible) or to this issue some time next week.


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147922#comment-14147922
 ] 

Tim Allison commented on TIKA-1330:
---

[~tilman], I leave it as an exercise to implement a FileResourceConsumer that 
uses pure PDFBox. ;) 

Seriously, though, I plan to add something like that in the tika examples 
module (at some point down the road), and all feedback is welcome.

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika at ApacheCon Europe - 2 months time!

2014-09-25 Thread David Meikle
Hey Nick,

On 22 Sep 2014, at 23:21, Nick Burch n...@apache.org wrote:

 It's only 2 months to go until ApacheCon Europe in Budapest. I'm 
 simultaneously exciting by all the great Tika stuff going on, and worried by 
 how many talks I need to finish writing...
 
 As usual for an ApacheCon, we've a number of talks about Tika going on, and 
 almost certainly a hackathon and/or meetup one evening. There's also lots of 
 related talks too, covering technologies that Tika builds on, and ones you 
 can use Tika with. For a full schedule, see:
 http://events.linuxfoundation.org/events/apachecon-europe/program/schedule

All signed up for ApacheCon EU this year, so looking forward to the talks and 
up for a Tika hackathon.

See you then,
Dave

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-09-25 Thread Vineet Ghatge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148398#comment-14148398
 ] 

Vineet Ghatge commented on TIKA-1423:
-

Pulling up the data and JAR file and trying to setup environment for the 
failing scenario

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1415) PowerPoint2003 embedded with word. The embedded file can not be detected.

2014-09-25 Thread sunxingzhe (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148528#comment-14148528
 ] 

sunxingzhe commented on TIKA-1415:
--

Atthachment  is the correction results, please confirm.
Atthachment file:HSLFExtractor.java_diff.html

 PowerPoint2003 embedded with word. The embedded file can not be detected.
 -

 Key: TIKA-1415
 URL: https://issues.apache.org/jira/browse/TIKA-1415
 Project: Tika
  Issue Type: Bug
  Components: detector, parser
Affects Versions: 1.5
 Environment: window7
Reporter: sunxingzhe
  Labels: Tika, poi
 Attachments: HSLFExtractor.java_diff.html, PowerPointParserTest.java, 
 test.java, word2003.ppt, word2007.ppt


 Word2003 or word2007  insert into Powerpoint2003 as embedded file。
 The embedded file‘s type can not be detected。
 The embedded file's content can not be parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1415) PowerPoint2003 embedded with word. The embedded file can not be detected.

2014-09-25 Thread sunxingzhe (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148528#comment-14148528
 ] 

sunxingzhe edited comment on TIKA-1415 at 9/26/14 2:44 AM:
---

Atthachment  is the modification result, please confirm.
Atthachment file:HSLFExtractor.java_diff.html


was (Author: sunxingzhe359):
Atthachment  is the correction results, please confirm.
Atthachment file:HSLFExtractor.java_diff.html

 PowerPoint2003 embedded with word. The embedded file can not be detected.
 -

 Key: TIKA-1415
 URL: https://issues.apache.org/jira/browse/TIKA-1415
 Project: Tika
  Issue Type: Bug
  Components: detector, parser
Affects Versions: 1.5
 Environment: window7
Reporter: sunxingzhe
  Labels: Tika, poi
 Attachments: HSLFExtractor.java_diff.html, PowerPointParserTest.java, 
 test.java, word2003.ppt, word2007.ppt


 Word2003 or word2007  insert into Powerpoint2003 as embedded file。
 The embedded file‘s type can not be detected。
 The embedded file's content can not be parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Apache Tika - JSON?

2014-09-25 Thread Vineet Ghatge Hemantkumar
Hello all,

I was wondering if there any in built parser to get help in conversion from
XHTML to JSON.

My research showed that there is one named org.apache.io.json which just
one method implemented. Also, I tried GJSON library to do this, but it does
not seem to work with Tika. Any suggestions will be appreciated?

Regards,
Vineet