date:20141023

[GitHub] tika pull request: TIKA-1446

2014-10-23 Thread thaichat04

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann mattm...@apache.org
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle dmei...@apache.org
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison talli...@apache.org
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch n...@apache.org
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181483#comment-14181483
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann mattm...@apache.org
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle dmei...@apache.org
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison talli...@apache.org
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch n...@apache.org
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority:

[jira] [Created] (TIKA-1455) Upgrade GSON dependency

2014-10-23 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1455:
-

 Summary: Upgrade GSON dependency
 Key: TIKA-1455
 URL: https://issues.apache.org/jira/browse/TIKA-1455
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1455) Upgrade GSON dependency

2014-10-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1455.
---
Resolution: Fixed

r1633850

 Upgrade GSON dependency
 ---

 Key: TIKA-1455
 URL: https://issues.apache.org/jira/browse/TIKA-1455
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181518#comment-14181518
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

Github user thaichat04 closed the pull request at:

https://github.com/apache/tika/pull/20


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] tika pull request: CHM Parser Improvement

2014-10-23 Thread thaichat04

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/21

CHM Parser Improvement

This pull request to improve Tika CHM Parser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thaichat04/tika TIKA-1446

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/21.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21


commit ac354e4fe22daf60326d240190c5da32cded6443
Author: hong-thai.nguyen hong-thai.ngu...@polyspot.com
Date:   2014-10-23T16:12:10Z

TIKA-1446 - Apply fix of [~binhawking] and some cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181530#comment-14181530
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank alot [~binhawking], I've quick look on your fix. Effectually, there's 
quite a lot of changes. After cleanup  fix some minor, I broke CHM tests.

We appreciate really your contribution and we should continue  finalize. I've 
created new pull request basing on a branch with your fix + my cleanup:
https://github.com/apache/tika/pull/21
https://github.com/thaichat04/tika.git, branch TIKA-1446

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2014-10-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181630#comment-14181630
 ] 

Andreas Lehmkühler commented on TIKA-1098:
--

I've finally solved PDFBOX-1273. The fix will be part of the upcoming version 
1.8.8 and 2.0.0.

Thanks for your patience :-)

 not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
 

 Key: TIKA-1098
 URL: https://issues.apache.org/jira/browse/TIKA-1098
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: linux redhat
Reporter: Qian Diao
 Attachments: url_1763_approx-alg-notes.pdf


 Hi,
 I got some parsing problems when using Tika 1.1 for the attached pdf file.
 my code (Test.java):
 import java.io.File;
 import java.io.InputStream;
 import java.io.FileInputStream;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.html.BoilerpipeContentHandler;
 import org.apache.tika.sax.BodyContentHandler;
 import org.apache.tika.parser.html.HtmlParser;
 import de.l3s.boilerpipe.extractors.ArticleExtractor;
 public class Test {
 private static final String validBoilerpipeFilenameRegEx = 
 .*(\\.)(htm|html|shtml|php|asp|aspx)$;
 public String parseFile(File inFile) {
 if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
 return null;

 InputStream is = null;
 String outputText = ;
 try {
 // Open input stream
 is = new FileInputStream(inFile);
 // Prepare parser
 BodyContentHandler contenthandler = new 
 BodyContentHandler(-1);
 Metadata metadata = new Metadata();
 metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
 ParseContext pc = new ParseContext();
 // Call parse with boilerpipe if valid boilerpipe extension; 
 otherwise, call regular parse.
 if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
 Parser parser = new AutoDetectParser();
 parser.parse(is, contenthandler, metadata, pc);
 }
 else {
 Parser parser = new HtmlParser();
 BoilerpipeContentHandler bh = new 
 BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
 parser.parse(is, bh, metadata, pc);
 }
 // Prepare text for write
 outputText = contenthandler.toString();
 } catch (Exception e) {
 System.out.println(e);
 return null;
 } finally {
 try { 
 if (is != null) 
 is.close(); 
 } catch (Exception e) {}
 }

 return outputText;
 }
 =output
 org.apache.tika.exception.TikaException: Unable to extract PDF content
 url_1763_approx-alg-notes.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the old parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial();
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( Error: The document is encrypted. );
}
}
{code}
The nonSeq parser does this automatically.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181799#comment-14181799
 ] 

Tim Allison commented on TIKA-1442:
---

If it is any consolation, the Cyrillic is totally hosed. :)

I'm hoping to get a basic file server set up (thanks to Rackspace) so that I 
can create hyperlinks for the source doc and for the extracted text/metadata so 
that you don't have to go hunting through the directory structure, and so that 
you can see what's extracted without running the app yourself.

That is probably a few weeks off though.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/23/14 7:31 PM:
-

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the old parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial();
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( Error: The document is encrypted. );
}
}
{code}
The nonSeq parser does this automatically.


Same for 892/892859.pdf


was (Author: tilman):
Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the old parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial();
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( Error: The document is encrypted. );
}
}
{code}
The nonSeq parser does this automatically.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181813#comment-14181813
 ] 

Tilman Hausherr commented on TIKA-1442:
---

The directory structure isn't a problem for me, I've downloaded all PDF files 
locally on a flat directory. Currently I'm still checking the files by hand, 
but I'll probably write a small script to extract and render with the different 
versions.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip

I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. 
However PDFBOX-2448 isn't relevant to 1.8.8.

Many changes are positive ones, files that no longer thrown an exception, or 
files that have better text extraction.


 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182047#comment-14182047
 ] 

Tilman Hausherr commented on TIKA-1442:
---

A few files have less meta data than before:
019/019837.pdf
138/138155.pdf
221/221001.pdf
224/224644.pdf
308/308233.pdf
469/469387.pdf
490/490345.pdf
490/490344.pdf
597/597244.pdf
643/643910.pdf

Could you tell what you get in TIKA for the first one?

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Vineet Ghatge (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182194#comment-14182194
 ] 

Vineet Ghatge commented on TIKA-1423:
-

Consumed the Parser to get data in HTML format and it works. I have attached 
the output to the documents. There is an issue with netCDFall4.5 jar keeps 
displaying these warnings 

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/netcdfAll-4.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/tika-app-1.7-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/slf4j-simple-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]

Tried to change the pom.xml of the tika, but that did not work either. Trying 
to remedy based on http://www.slf4j.org/codes.html#multiple_binding and 
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/JarDependencies.html

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularlydistributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format  GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS)  optional 
 (3) Bit Map Section (BMS)  optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Vineet Ghatge (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Ghatge updated TIKA-1423:

Attachment: fileName.html

Output in HTML

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularlydistributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format  GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS)  optional 
 (3) Bit Map Section (BMS)  optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182208#comment-14182208
 ] 

Lewis John McGibbney commented on TIKA-1423:


p.s. do you have a patch against Tika trunk so that we can test? Thanks

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularlydistributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format  GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS)  optional 
 (3) Bit Map Section (BMS)  optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182204#comment-14182204
 ] 

Lewis John McGibbney commented on TIKA-1423:


Output looks fantastic, can you please do 
{code}
mvn dependency:analyze-report
{code}
and see if you can resolve the slf4j-simple conflict between tika-app/pom.xml 
and tika-parsers/pom.xml when you add the netCDF library.
It probably worth trying to exclude the logging dependency from the netCDF 
dependency similar to what is done here
https://github.com/apache/gora/blob/master/gora-accumulo/pom.xml#L144
hth, great work.
Lewis


 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularlydistributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format  GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS)  optional 
 (3) Bit Map Section (BMS)  optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (TIKA-443) Geographic Information Parser

2014-10-23 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-443:
--

Assignee: Chris A. Mattmann

 Geographic Information Parser
 -

 Key: TIKA-443
 URL: https://issues.apache.org/jira/browse/TIKA-443
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Arturo Beltran
Assignee: Chris A. Mattmann
 Attachments: getFDOMetadata.xml


 I'm working in the automatic description of geospatial resources, and I think 
 that might be interesting to incorporate new parser/s to Tika in order to 
 manage and describe some geo-formats. These geo-formats include files, 
 services and databases.
 If anyone is interested in this issue or want to collaborate do not hesitate 
 to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-443) Geographic Information Parser

2014-10-23 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182462#comment-14182462
 ] 

Chris A. Mattmann commented on TIKA-443:


Guys, I wonder if we should (now 4 years later) standardize on Apache SIS 
(http://sis.apache.org/) and incorporate its support for parsing ISO19115 
metadata. It seems to have the same types of properties that FDO metadata XML 
has. 

I'm going to give a whirl at creating a GeoParser that extracts information 
from ISO 19115 XML files. [~desruisseaux] FYI [~adamestrada] FYI.

 Geographic Information Parser
 -

 Key: TIKA-443
 URL: https://issues.apache.org/jira/browse/TIKA-443
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Arturo Beltran
Assignee: Chris A. Mattmann
 Attachments: getFDOMetadata.xml


 I'm working in the automatic description of geospatial resources, and I think 
 that might be interesting to incorporate new parser/s to Tika in order to 
 manage and describe some geo-formats. These geo-formats include files, 
 services and databases.
 If anyone is interested in this issue or want to collaborate do not hesitate 
 to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: import (re)ordering?

2014-10-23 Thread Mattmann, Chris A (3980)

Hey Tim,

No big objections from me, but it will dilute things so glad we
have it noted if it happens.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, October 21, 2014 at 1:59 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: import (re)ordering?

All,
  I have Intellij set to order imports by javax, java, then other.  I
think this is the most common pattern in Tika.  Is it ok if I make these
(meaningless/formatting) changes when I commit other changes?
  Thank you.

   Best,

  Tim

[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-23 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182476#comment-14182476
 ] 

Chris A. Mattmann commented on TIKA-1451:
-

great work Tim!

 Add Recursive Metadata Parser Wrapper output to tika-app and gui
 

 Key: TIKA-1451
 URL: https://issues.apache.org/jira/browse/TIKA-1451
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.7

 Attachments: integrate_recursive_metadata_wrapper.patch


 It would be helpful to expose the output of the recursive metadata parser 
 wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] tika pull request: TIKA-1446

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

[jira] [Created] (TIKA-1455) Upgrade GSON dependency

[jira] [Resolved] (TIKA-1455) Upgrade GSON dependency

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

[GitHub] tika pull request: CHM Parser Improvement

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

[jira] [Assigned] (TIKA-443) Geographic Information Parser

[jira] [Commented] (TIKA-443) Geographic Information Parser

Re: import (re)ordering?

[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

22 matches

Site Navigation

Mail list logo

Footer information