[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1447.

Resolution: Fixed

 CHM parser: wrong directory list
 

 Key: TIKA-1447
 URL: https://issues.apache.org/jira/browse/TIKA-1447
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical

 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
 tika-parser's test-resources):
 1. Duplicate entries (mostly from PMGI chunks, which should have been 
 ignored.)
 2. Invalid entry (usually with unreadable entry name).
 3. Missed entries (some times it is like TIKA-1176)
 I have fixed it (to some degree), by using the PMGL header to find dir chunks 
 and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1446.

Resolution: Fixed

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1447:
---
Fix Version/s: 1.7

 CHM parser: wrong directory list
 

 Key: TIKA-1447
 URL: https://issues.apache.org/jira/browse/TIKA-1447
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
 tika-parser's test-resources):
 1. Duplicate entries (mostly from PMGI chunks, which should have been 
 ignored.)
 2. Invalid entry (usually with unreadable entry name).
 3. Missed entries (some times it is like TIKA-1176)
 I have fixed it (to some degree), by using the PMGL header to find dir chunks 
 and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1448.

Resolution: Fixed

 CHM parser : defect in file extraction
 --

 Key: TIKA-1448
 URL: https://issues.apache.org/jira/browse/TIKA-1448
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Bin Hawking
 Fix For: 1.7


 in ChmBlockInfo class:
 chmBlockInfo
 .setIniBlock((chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock)
 % (int) clcd.getResetInterval());
 always sets 0
 according to the lzx algorithm, should be
 chmBlockInfo
 .setIniBlock( chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock
 % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1430.

Resolution: Fixed

 CHM parser gets faulty text (fix found)
 ---

 Key: TIKA-1430
 URL: https://issues.apache.org/jira/browse/TIKA-1430
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
 Environment: Windows 7; JDK 7 or 8
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 Get partially wrong text out of a CHM file, including the chm files in 
 tika-parsers/src/test/resources/test-documents/testChm*.chm
 I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
 I checked the source code. The cause is obvious:
 When tika decompresses the LZX, the first block is done well, but as to the 
 2nd block and later on, Tika uses previous content as the compressed data. 
 see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
 
 if (prevBlock != null
  prevBlock.getState().getBlockLength()  prevBlock
 .getState().getBlockRemaining())
 setChmSection(new ChmSection(prevBlock.getContent()));
 //   NOTE: the dataSegment to be decompressed is not kept
 else
 setChmSection(new ChmSection(dataSegment));
 
 My fix:
 1.Add a prevcontent member variable in ChmSection class, so that 
 dataSegment and prevBlock.getContent() are both kept in it.
 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
 pass ChmSection.prevcontent if exists, instead of ChmSection.data.
 Now, I tried some chm files, and got the correct looking texts. 
 BTW. The unit test should be tougher, as in this case some small text (the 
 first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1430:
---
Fix Version/s: 1.7

 CHM parser gets faulty text (fix found)
 ---

 Key: TIKA-1430
 URL: https://issues.apache.org/jira/browse/TIKA-1430
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
 Environment: Windows 7; JDK 7 or 8
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 Get partially wrong text out of a CHM file, including the chm files in 
 tika-parsers/src/test/resources/test-documents/testChm*.chm
 I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
 I checked the source code. The cause is obvious:
 When tika decompresses the LZX, the first block is done well, but as to the 
 2nd block and later on, Tika uses previous content as the compressed data. 
 see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
 
 if (prevBlock != null
  prevBlock.getState().getBlockLength()  prevBlock
 .getState().getBlockRemaining())
 setChmSection(new ChmSection(prevBlock.getContent()));
 //   NOTE: the dataSegment to be decompressed is not kept
 else
 setChmSection(new ChmSection(dataSegment));
 
 My fix:
 1.Add a prevcontent member variable in ChmSection class, so that 
 dataSegment and prevBlock.getContent() are both kept in it.
 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
 pass ChmSection.prevcontent if exists, instead of ChmSection.data.
 Now, I tried some chm files, and got the correct looking texts. 
 BTW. The unit test should be tougher, as in this case some small text (the 
 first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1446:
---
Fix Version/s: 1.7

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7

 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1448:
---
Fix Version/s: 1.7

 CHM parser : defect in file extraction
 --

 Key: TIKA-1448
 URL: https://issues.apache.org/jira/browse/TIKA-1448
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Bin Hawking
 Fix For: 1.7


 in ChmBlockInfo class:
 chmBlockInfo
 .setIniBlock((chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock)
 % (int) clcd.getResetInterval());
 always sets 0
 according to the lzx algorithm, should be
 chmBlockInfo
 .setIniBlock( chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock
 % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-672:
--
Fix Version/s: 1.7

 Proper error handling in the CHM parser
 ---

 Key: TIKA-672
 URL: https://issues.apache.org/jira/browse/TIKA-672
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Fix For: 1.7


 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
 System.out prints to report problems in many places. We should change that to 
 properly throw exceptions as follows:
 - IOExceptions when the document stream can not be read
 - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-672.
---
Resolution: Fixed

Check no more System.err/System.out inside CHM parser

 Proper error handling in the CHM parser
 ---

 Key: TIKA-672
 URL: https://issues.apache.org/jira/browse/TIKA-672
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Fix For: 1.7


 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
 System.out prints to report problems in many places. We should change that to 
 properly throw exceptions as follows:
 - IOExceptions when the document stream can not be read
 - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Subsets of tika parsers redux

2014-11-24 Thread Sergey Beryozkin

Hi Nick

Was good talking to you and thanks for initiating this thread.

It is an interesting idea, one that can lead to introducing 
finer-grained bundles but also providing a mechanism for the 
(auto-)generation of the import metadata required by each of the parser 
modules. Besides, introducing several smaller bundles that would group 
most popular formats is a good one on its own IMHO.


My doubt here is how many of those bundles we'd need to create and if it 
will make it easy for users to get a task like Get a parser for the 
format A only, or parsers A and B formats only done.


Are we talking about introducing a parser module per every supported 
format, and having tika-parsers depend on all of those modules, with 
every parser module becoming a bundle (a jar plus an entry in the 
manifest) ?


Thanks, Sergey


On 23/11/14 17:12, Nick Burch wrote:

Hi All

During ApacheCon, I had a chance to chat with Sergey about the subset
of Tika Parsers issue that bubbles up from time to time. It seemed to
work well, and I think we both now have a better idea of the other's
needs and concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere,
we have some users who are confused already by the split between
tika-core and tika-parsers. Anything that fragments further is going to
cause more issues for that kind of user.

On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of
those are using OSGi, but not all.

One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain
parsers + shrink the resulting dependency tree. However, that requires
manual updates, manual checking, and like our examples on the website
risk getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the
website into svn, with unit tests, and having the website pull those
from svn on the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the full
bundle as now, but also have ones for pdf, microsoft office,
images etc. OSGi users (eg CXF users) could then opt to depend on
pdf+image if they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller
bundle. These unit tests also ensure that unwanted parsers don't work,
thus flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into
the bundle, and display that for non-OSGi users. A non-OSGi person
wanting tika with pdf only could then look at what the tika-pdf-bundle
does and doesn't use, and from that know what maven level dependencies
to keep and which to exclude


This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of common cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone
want to help? :)

Nick




[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-24 Thread Milan Zivkovic (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222927#comment-14222927
 ] 

Milan Zivkovic commented on TIKA-1473:
--

Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = path_to_file;
final Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( 
path ) ) );
is = TikaInputStream.get( is );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}

 Apache Tika is not working for .docx documents 
 ---

 Key: TIKA-1473
 URL: https://issues.apache.org/jira/browse/TIKA-1473
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
Reporter: Franco Catto
Priority: Blocker

 I am using Apache Tika 1.6 to read different document files. 
 It is reading pdf and old format doc files but when I try to read docx file, 
 it gives me following exception:
 org.apache.tika.exception.TikaException: Failed to close temporary resources 
 at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
 ...
 The resource can not be closed because it is still being used by the Java 
 Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-24 Thread Milan Zivkovic (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222927#comment-14222927
 ] 

Milan Zivkovic edited comment on TIKA-1473 at 11/24/14 11:56 AM:
-

Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = path_to_file;
final Metadata metadata = new Metadata();
final InputStream is = TikaInputStream.get( Files.newInputStream( 
Paths.get( path ) ) );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}


was (Author: mzivkovic):
Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = path_to_file;
final Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( 
path ) ) );
is = TikaInputStream.get( is );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}

 Apache Tika is not working for .docx documents 
 ---

 Key: TIKA-1473
 URL: https://issues.apache.org/jira/browse/TIKA-1473
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
Reporter: Franco Catto
Priority: Blocker

 I am using Apache Tika 1.6 to read different document files. 
 It is reading pdf and old format doc files but when I try to read docx file, 
 it gives me following exception:
 org.apache.tika.exception.TikaException: Failed to close temporary resources 
 at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
 ...
 The resource can not be closed because it is still being used by the Java 
 Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1332) Create eval code

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222947#comment-14222947
 ] 

Tim Allison commented on TIKA-1332:
---

In a personal communication, I asked [~sergey_beryozkin] for recommendations 
for handling static content in the jax-rs framework.  For the UI component of 
the eval code -- how the user interacts with the results of the eval -- Is 
there an easy equivalent in JAX-RS that allows for the user to browse a 
directory of files and click on desired files for download as easily as one can 
with Jetty's ResourceHandler.

With permission, I'm posting/summarizing [~sergey_beryozkin]'s responses.  If 
anyone else has a recommendation leveraging the JAX-RS framework for dynamic 
data and still using something so easy as Jetty's ResourceHandler for static 
content, please let us know.

Option 1: 
Handcode a JAX-RS handler that mimics Jetty's ResourceHandler
 That can be easily enough though with JAX-RS if you'd like to explore
 this path, something like this I guess:

{noformat}
 @Path(eval)
 public class TikaEvaluation {
   @Context
   private UriInfo ui;
   @GET
   @Path(list)
   @Produces(text/html)
   public Response getListOfResultURIs() {
   ListURI uris = new LinkedListURI();
   for (File f : getResultFiles()) {

   uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build());
  }
   // uris list now how a list of links to individual files
   // next we need to decide how to convert that to HTML
   // one option is to return the list as is and redirect that to
   // JSP, another option is to build a basic HTML string right here in 
the
   // method, another option is to register a MessageBodyWriter that 
will
   // convert the list into HTML
   // the individual links will be managed by getFile() method

   return Response.ok(uris).build();
   }

   @GET
   @Path(list/{name})
   @Produces(application/json, multipart/mixed)
   public Response getFile(@PathParam(name) String name) {
   ...
   }

{noformat}

Option 2:
Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting 
the JAX-RS code.
 This link would probably be the best one: [link| 
 https://git-wip-us.apache.org/repos/asf?p=cxf.git;a=blob_plain;f=distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Server.java;hb=HEAD]

 Tika JAX-RS server actually runs on top of Jetty right now too, but in
 this case we have a direct Jetty server setup.

 The server registers a CXF servlet and Jetty handlers too. CXF servlet
 also redirect to default handlers like a default handler for serving the
 static content. This is not needed if the result files are accessible
 over URI that does not overlap with a CXF servlet URI pattern.
 In fact, I wonder if a Tika JAXRS style of the registration may also do
 ? If you register a CXF endpoint at /eval and the results are accessible
 over /results then it should  work ? Unless Jetty ContentHandler is not
 installed by default - then the linked to code would def do :-)

 the only possible downside here is that as far as the consistent URI 
 space management is concerned we'd have one part of it (the static 
 resources) controlled natively by Jetty and the rest - by JAX-RS. so it 
 can be trickier to provide a support for searching the results, 
 enforcing the common security rules (when/if needed).
 That said may be it is not of a real concern, it can always be removed 
 in the future if needed.


Other options?


 Create eval code
 --

 Key: TIKA-1332
 URL: https://issues.apache.org/jira/browse/TIKA-1332
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison

 For this issue, we can start with code to gather statistics on each run (# of 
 exceptions per file type, most common exceptions per file type, number of 
 metadata items, total text extracted, etc).  We should also be able to 
 compare one run against another.  Going forward, there's plenty of room to 
 improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Subsets of tika parsers redux

2014-11-24 Thread Mattmann, Chris A (3980)
Hey Nick,

This sounds like a great plan to me, good job to you
and Sergey. As for helping I¹ll try my best, but I¹m not
an OSGI guru :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nick Burch n...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Sunday, November 23, 2014 at 6:12 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Subsets of tika parsers redux

Hi All

During ApacheCon, I had a chance to chat with Sergey about the subset of
Tika Parsers issue that bubbles up from time to time. It seemed to work
well, and I think we both now have a better idea of the other's needs and
concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere,
we 
have some users who are confused already by the split between tika-core
and tika-parsers. Anything that fragments further is going to cause more
issues for that kind of user.

On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
are using OSGi, but not all.

One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain
parsers 
+ shrink the resulting dependency tree. However, that requires manual
updates, manual checking, and like our examples on the website risk
getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the
website 
into svn, with unit tests, and having the website pull those from svn on
the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the full
bundle as now, but also have ones for pdf, microsoft office, images
etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller bundle.
These unit tests also ensure that unwanted parsers don't work, thus
flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into the
bundle, and display that for non-OSGi users. A non-OSGi person wanting
tika with pdf only could then look at what the tika-pdf-bundle does and
doesn't use, and from that know what maven level dependencies to keep and
which to exclude


This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of common cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone
want 
to help? :)

Nick



[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223430#comment-14223430
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:02 PM:


[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.


was (Author: lehmi):
[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDRFBOX-2520.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223430#comment-14223430
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:09 PM:


[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.

P.S.: I don't know who'll be faster, jenkins or you. So you might have to wait 
until the lastest binaries are available if you don't use your own compiled 
version ;-)


was (Author: lehmi):
[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223460#comment-14223460
 ] 

Tim Allison commented on TIKA-1442:
---

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223460#comment-14223460
 ] 

Tim Allison edited comment on TIKA-1442 at 11/24/14 8:34 PM:
-

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ?


was (Author: talli...@mitre.org):
Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler commented on TIKA-1442:
--

Yes, build 145 should include the latest changes

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223460#comment-14223460
 ] 

Tim Allison edited comment on TIKA-1442 at 11/24/14 8:39 PM:
-

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?

Is there an easy way to tell when Maven is picking up the correct version from 
my version1.8.8-SNAPSHOT/version definition?
Currently, I'm getting: pdfbox-1.8.8-20141124.203308-145.jar ?


was (Author: talli...@mitre.org):
Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ?

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:03 PM:


Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]


was (Author: lehmi):
Yes, build 145 should include the latest changes

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:04 PM:


[~talli...@apache.org] Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]


was (Author: lehmi):
Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-24 Thread Darya Arbuzova (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224167#comment-14224167
 ] 

Darya Arbuzova commented on TIKA-1481:
--

Thank you, Sergey!

I was trying to find a better place to ask my questions, but I can't send 
anything at u...@tika.apache.org (provided in the mailing list: 
http://tika.apache.org/mail-lists.html) and didn't come across any QA-like 
pages. What do you mean by «users list»?
Sorry I'm posting in the wrong place.

Best regards,
Darya Arbuzova

 TikaJAXRS get metadata calls give different results
 ---

 Key: TIKA-1481
 URL: https://issues.apache.org/jira/browse/TIKA-1481
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.6
 Environment: Windows 8, JDK 1.8
Reporter: Darya Arbuzova
Priority: Minor
 Attachments: sample.csv


 Hello!
 I'm trying to use Tika in server mode.
 I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
 I have tried to get file metadata in 2 different ways (as explained here: 
 http://wiki.apache.org/tika/TikaJAXRS ):
 {{ curl -T sample.csv http://localhost:9998/meta --header Content-Type: 
 text/csv}}
 {{Content-Encoding,windows-1252}}
 {{Content-Type,text/plain; charset=windows-1252}}
 and
 {{ curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
 Content-Type: text/csv}}
 {{Content-Encoding,ISO-8859-1}}
 {{Content-Type,text/plain; charset=ISO-8859-1}}
 How come they give different results in encoding if I call the same 
 {{http://localhost:9998/meta}}?
 What could the other differences appear and which is the preferable way to 
 get metadata?
 Many thanks!
 Best regards,
 Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)