[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242408#comment-13242408
 ] 

Jukka Zitting commented on TIKA-888:


bq. The question is: The parser is still listed in META-INF, so when a Java 5 
users tries to parse a NetCDF file, he gets a ClassNotFound by the NetCDF 
parser. Whats the best way to handle that?

By default Tika should just ignore the {{ClassNotFoundException}} in such a 
case, so there should be no harm in having the NetCDF parser included in the 
services file.

If that doesn't work (i.e. you get an error when starting Tika), we should fix 
Tika to catch the problem.

> NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, 
> although TIKA is Java 1.5
> --
>
> Key: TIKA-888
> URL: https://issues.apache.org/jira/browse/TIKA-888
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.0
>Reporter: Uwe Schindler
>Assignee: Chris A. Mattmann
>
> Lucene/Solr developers ran this tool before releasing Lucene/Solr 3.6 (Solr 
> 3.6 is still required to run on Java 1.5, see SOLR-3295): 
> http://code.google.com/p/versioncheck/
> {noformat}
> Major.Minor Version : 50.0 JAVA compatibility : Java 1.6 
> platform: 45.3-50.0
> Number of classes : 60
> Classes are: 
> c:\Work\lucene-solr\.\solr\contrib\extraction\lib\netcdf-4.2-min.jar [:] 
> ucar/unidata/geoloc/Bearing.class
> ...
> {noformat}
> TIKA should use a 1.5 version of this class and especially do some Java 5 
> tests before releasing (as it's build dependencies says, it's minimum Java5). 
> I tried to compile and run TIKA tests with Java 1.5 -> crash (Invalid class 
> file format).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-878) Reuse computed Map inside CompositeParser

2012-03-19 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232975#comment-13232975
 ] 

Jukka Zitting commented on TIKA-878:


Do you have a benchmark that shows this to be a notable performance bottleneck 
(compared to the time it takes to parse a typical document)? If not, I suggest 
we keep the functionality as is to allow dynamic loading and unloading of 
parser plugins.

> Reuse computed Map inside CompositeParser
> 
>
> Key: TIKA-878
> URL: https://issues.apache.org/jira/browse/TIKA-878
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
>Reporter: Luis Filipe Nassif
>Priority: Minor
>
> CompositeParser.parse() method calls getParser(metadata), that in turn calls 
> getParsers(context) for each file to construct a Map from 
> the parsers List. It would be faster to compute the Map 
> only once and store the result for latter use.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-866) Invalid configuration file causes OutOfMemoryException

2012-02-17 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210429#comment-13210429
 ] 

Jukka Zitting commented on TIKA-866:


Actually, scrap the above rationale. The DefaultParser is OK for inclusion in a 
configuration file (that's actually what it was designed for, see TIKA-527), 
it's just AutoDetectParser that wouldn't work well with that mechanism. The 
infinite loop triggered by DefaultParser was rather a result of an unnecessary 
getDefaultConfig() call in MediaTypeRegistry.getDefaultRegistry().

I replaced that call and restored the ability to use DefaultParser in 
configuration in revision 1245692. And as discussed above, I also improved the 
config code use the default parser or detector loading mechanism when no 
explicit  or  entries are present in a configuration file. A 
missing mimetypes entry was already being handled by loading the default 
settings, which was the original cause of the OOM as explained above.

> Invalid configuration file causes OutOfMemoryException
> --
>
> Key: TIKA-866
> URL: https://issues.apache.org/jira/browse/TIKA-866
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 1.0
>Reporter: Stephan Mühlstrasser
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.1
>
> Attachments: ConfigFile.java
>
>
> I tried to override a built-in parser according to the method described in 
> issue TIKA-527. During testing this approach I used an incomplete 
> configuration file (as far as I learned from a discussion on the mailing list 
> also mimetypes and a detector should be specified):
> $ cat tika-config.xml
> 
> 
> 
> 
> 
> Using this configuration file causes an OutOfMemoryException:
> $ java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> at java.util.Arrays.copyOfRange(Arrays.java:3209)
> at java.lang.String.(String.java:216)
> at java.lang.StringBuilder.toString(StringBuilder.java:430)
> at org.apache.tika.mime.MediaType.toString(MediaType.java:237)
> at org.apache.tika.detect.MagicDetector.(MagicDetector.java:142)
> at 
> org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:254)
> at 
> org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:202)
> at 
> org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:186)
> at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152)
> at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:124)
> at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:107)
> at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:63)
> at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:91)
> at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:147)
> at 
> org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455)
> at 
> org.apache.tika.config.TikaConfig.typesFromDomElement(TikaConfig.java:273)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:161)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
> at 
> org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
> at org.apache.tika.parser.DefaultParser.(DefaultParser.java:52)
> at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at java.lang.Class.newInstance0(Class.java:355)
> at java.lang.Class.newInstance(Class.java:308)
> at 
> org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:288)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:162)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
> at 
> org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
> at org.apache.tika.parser.DefaultParser.(DefaultParser.java:52)
> at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>  
> Expected behavior: If the configuration file is not valid, and appropriate 
> exception should be produced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please cont

[jira] [Commented] (TIKA-864) Metadata.formatDate should use ThreadLocal

2012-02-17 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210240#comment-13210240
 ] 

Jukka Zitting commented on TIKA-864:


Like in TIKA-865, is this a real measurable performance bottleneck? If not, I 
suggest we keep the code as is.

> Metadata.formatDate should use ThreadLocal
> --
>
> Key: TIKA-864
> URL: https://issues.apache.org/jira/browse/TIKA-864
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of 
> DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances 
> and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-865) MimeTypes.forName should avoid method-level synchronization

2012-02-17 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210237#comment-13210237
 ] 

Jukka Zitting commented on TIKA-865:


I'd keep the synchronization on "this", as it also covers the add() method call 
that
modifies more than just the types map.

More broadly, is there a real measurable performance bottleneck here? If not, 
I'd rather optimize for code clarity and simplicity.


> MimeTypes.forName should avoid method-level synchronization
> ---
>
> Key: TIKA-865
> URL: https://issues.apache.org/jira/browse/TIKA-865
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.1
>Reporter: Andrzej Bialecki 
>
> MimeTypes.forName can avoid method-level sync, instead it could push down the 
> sync block only if a modification of Map types is 
> required. In most cases this should be enough to free the common case (i.e. 
> when a media type is already registered) from synchronization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-860) Make ZIP bomb detection configureable

2012-02-10 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205453#comment-13205453
 ] 

Jukka Zitting commented on TIKA-860:


bq. Couldnt SecureContentHandler not simply get the ParseContext

That would introduce a circular dependency between the o.a.t.parser and 
o.a.t.sax packages (currently the parser -> sax dependency is one-way). I'd 
rather keep the SecureContentHandler class concerned only with IO streams and 
SAX events.

> Make ZIP bomb detection configureable
> -
>
> Key: TIKA-860
> URL: https://issues.apache.org/jira/browse/TIKA-860
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
>Reporter: Uwe Schindler
>
> The detection of ZIP bombs is nice and the original issue says it's 
> configureable, but I found no solution how to change ParseContext of the 
> AutoDetectParser to e.g. allow deeper nesting levels. The 
> SecureContentHandler instantiation is hardcoded and there is no point of 
> intervention.
> In my case a simple ZIP of an Eclipse project: 
> http://store.pangaea.de/Publications/AltaweelM_2011/Salinization.zip 
> triggered the bomb detection, but it is of course no bomb. Its just because 
> the JAR/WAR files in this projects itself contain other JAR files and class 
> files :-) This overflows the nesting level of 10 - maybe even the TIKA OSGI 
> bundle triggers the bomb detection (not tested).
> In my case I would like to raise the nesting level, but there is no solution. 
> My change was to simply filter away JAR files (as they contain no metadata we 
> are interested in our own development, we already removed e.g. CLASS file 
> parsers from out TIKA config so we have a very simple parser structure only 
> allowing pdf, office documents, txt files,...) by using a custom 
> DocumentSelector in my ParseContext.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-860) Make ZIP bomb detection configureable

2012-02-10 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205440#comment-13205440
 ] 

Jukka Zitting commented on TIKA-860:


The mentioned configurability is currently only at the SecureContentHandler 
level (see the setter methods). The higher level AutoDetectParser uses just the 
default settings for now.

I guess the simplest solution to this is just to increase the default 
maxPackageEntryDepth setting along the lines of TIKA-741. In fact I think we 
should in any case do that since this issue identifies a false positive case 
that should be fixed.

If we additionally want to make this configurable on a case-by-case basis, I 
think the best approach would be to add something like a 
SecureContentHandlerFactory interface or class that can be passed through the 
ParseContext and that the AutoDetectParser would use to create the 
SecureContentHandler instance used for Zip bomb prevention.

> Make ZIP bomb detection configureable
> -
>
> Key: TIKA-860
> URL: https://issues.apache.org/jira/browse/TIKA-860
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
>Reporter: Uwe Schindler
>
> The detection of ZIP bombs is nice and the original issue says it's 
> configureable, but I found no solution how to change ParseContext of the 
> AutoDetectParser to e.g. allow deeper nesting levels. The 
> SecureContentHandler instantiation is hardcoded and there is no point of 
> intervention.
> In my case a simple ZIP of an Eclipse project: 
> http://store.pangaea.de/Publications/AltaweelM_2011/Salinization.zip 
> triggered the bomb detection, but it is of course no bomb. Its just because 
> the JAR/WAR files in this projects itself contain other JAR files and class 
> files :-) This overflows the nesting level of 10 - maybe even the TIKA OSGI 
> bundle triggers the bomb detection (not tested).
> In my case I would like to raise the nesting level, but there is no solution. 
> My change was to simply filter away JAR files (as they contain no metadata we 
> are interested in our own development, we already removed e.g. CLASS file 
> parsers from out TIKA config so we have a very simple parser structure only 
> allowing pdf, office documents, txt files,...) by using a custom 
> DocumentSelector in my ParseContext.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-853) java.io.IOException with TikaGUI and testMP4.m4a

2012-02-02 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198767#comment-13198767
 ] 

Jukka Zitting commented on TIKA-853:


Do you have a virus scanner running? I've seen quite a few cases where a virus 
scanner or some other similar file scanning tool (desktop search, etc.) 
prevents files from being removed. If that's the case here, I guess it's best 
to open a separate issue for the problem.

We might be able to work around that for example by detecting a case where a 
temporary file can't be removed for such a reason, and retrying the delete 
after a small delay or using File.deleteOnExit().

> java.io.IOException with TikaGUI and testMP4.m4a
> 
>
> Key: TIKA-853
> URL: https://issues.apache.org/jira/browse/TIKA-853
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser
>Affects Versions: 1.1
> Environment: Windows 7
>Reporter: John Mastarone
>
> Using a latest build: when attempting to drop the new testMP4.m4a file into 
> the Tika GUI, a TikaException / IOException occurs:  
> org.apache.tika.exception.TikaException: Failed to close temporary resources
>   at 
> org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>   at javax.swing.TransferHandler.importData(Unknown Source)
>   at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
> 
> Caused by: java.io.IOException: Could not delete temporary file 
> C:\Users\john\AppData\Local\Temp\apache-tika-693752014807275949.tmp
>   at 
> org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
>   at 
> org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
>   at 
> org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
>   ... 40 more
> I know that the parser for this file is new and its external source parser 
> has some potential bugs, but this exception does not occur when using Tika 
> CLI to detect / parse the test file.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-843) Support for Date without a Time Component

2012-01-20 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189897#comment-13189897
 ] 

Jukka Zitting commented on TIKA-843:


FWIW, I've found it most reliable to convert a date with no time component to 
noon UTC of that day when a Java date is needed. Then it'll most likely fall 
within the correct day regardless of the local timezone.

> Support for Date without a Time Component
> -
>
> Key: TIKA-843
> URL: https://issues.apache.org/jira/browse/TIKA-843
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.0
>Reporter: Ray Gauss II
>Priority: Minor
> Fix For: 1.1
>
> Attachments: date-format-patch.diff
>
>
> Should be able to support parsing of dates without a time component, i.e. 
> 2011:08:31

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-833) POI Daily beta6 as of 12/27 breaks ExcelParserTest.testExcelParserFormatting()

2011-12-29 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177248#comment-13177248
 ] 

Jukka Zitting commented on TIKA-833:


Thanks, Jeremy!

> POI Daily beta6 as of 12/27 breaks ExcelParserTest.testExcelParserFormatting()
> --
>
> Key: TIKA-833
> URL: https://issues.apache.org/jira/browse/TIKA-833
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
>Reporter: Jeremy Anderson
>  Labels: excel, poi, test
>
> Attn Nick:
> Changes made to POI(v1221126) for POI-52349 causes the 
> ExcelParserTest.testExcelParserFormatting() junit test case to go out to 
> lunch within POI's DataFormatter.FractionFormat().  Specifically within the 
> nested for loops at line 1000.  These nested loops both have extremely high 
> values that cause the test case to hang for an extremely long duration during 
> the new OfficeParser().parse(input, handler, metadata, context); call at line 
> 73.  This appears to happen due to the Custom Date value in the test file not 
> being identified as a date.
> I'm not sure if the proper resolution lies in modifying TIKA's test case or 
> if a new bug needs to be opened on POI's side, but I wanted to bring this 
> issue to light.
> Regards,
> Jeremy  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-833) POI Daily beta6 as of 12/27 breaks ExcelParserTest.testExcelParserFormatting()

2011-12-29 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177183#comment-13177183
 ] 

Jukka Zitting commented on TIKA-833:


It's good that we monitor changes in POI and make sure that Tika works properly 
with the latest versions, but until the changes in POI have been released and 
relevant dependency in Tika trunk updated, this is not an bug in Tika and IMHO 
shouldn't be tracked as one.

The POI bug #52389 is more appropriate, and rather than resolving this issue as 
fixed for Tika 1.1 I think we should resolve this as Invalid or Duplicate (of 
POI bug #52389) to keep the 1.1 release notes clean.

> POI Daily beta6 as of 12/27 breaks ExcelParserTest.testExcelParserFormatting()
> --
>
> Key: TIKA-833
> URL: https://issues.apache.org/jira/browse/TIKA-833
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
>Reporter: Jeremy Anderson
>  Labels: excel, poi, test
> Fix For: 1.1
>
>
> Attn Nick:
> Changes made to POI(v1221126) for POI-52349 causes the 
> ExcelParserTest.testExcelParserFormatting() junit test case to go out to 
> lunch within POI's DataFormatter.FractionFormat().  Specifically within the 
> nested for loops at line 1000.  These nested loops both have extremely high 
> values that cause the test case to hang for an extremely long duration during 
> the new OfficeParser().parse(input, handler, metadata, context); call at line 
> 73.  This appears to happen due to the Custom Date value in the test file not 
> being identified as a date.
> I'm not sure if the proper resolution lies in modifying TIKA's test case or 
> if a new bug needs to be opened on POI's side, but I wanted to bring this 
> issue to light.
> Regards,
> Jeremy  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-830) Tika.parseToString() causes ForkParser to try to serialize itself

2011-12-29 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177088#comment-13177088
 ] 

Jukka Zitting commented on TIKA-830:


Excellent, thanks Nick!

> Tika.parseToString() causes ForkParser to try to serialize itself
> -
>
> Key: TIKA-830
> URL: https://issues.apache.org/jira/browse/TIKA-830
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: Jerome Lacoste
>Priority: Blocker
> Attachments: 
> 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch, 
> 0006-TIKA-830-refactor-tests-for-clarity.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-830) Tika.parseToString() causes ForkParser to try to serialize itself

2011-12-28 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176702#comment-13176702
 ] 

Jukka Zitting commented on TIKA-830:


The problem here is the basic assumption that the Tika facade class makes about 
how the configured parser will use the instance passed in the ParseContext.

By default (and before we added the constructor that allows a custom parser to 
be given) the Tika facade will construct and use an AutoDetectParser based on 
all the available and/or configured format-specific parsers. Format-specific 
parsers that support embedded documents expect the ParseContext to contain a 
parser instance that they can delegate parsing tasks to, so to support parsing 
of embedded documents the Tika facade passes the configured parser instance 
through the ParseContext.

The ForkParser on the other hand assumes that anything in the ParseContext is 
serializable so that it can be sent to the forked JVM process for use from 
there. Passing a ForkParser instance to the forked JVM like through the 
ParseContext could easily trigger a recursion of new JVM forks being created, 
which is why the ForkParser by design is not serializable.

I agree with Nick that the resulting error message could certainly be better, 
but I don't it's a good idea to change the basic design of either ForkParser or 
the Tika facade class in this respect.

If we want the Tika facade class to support forked parsing, I think it would be 
better to add a separate flag for that to explicitly make the facade class 
create and use a ForkParser instance based on the configured normal Parser 
instance. However, the ForkParser is a pretty complex tool that practically 
always needs custom configuration (java command, memory limits, class loader, 
etc.), which is why I don't think we should expose it through the Tika facade 
that's mostly designed for simpler use cases.

PS. Instead of the instanceof check we now have in ForkParser (thanks for that, 
BTW!), it might be a better idea to check for errors from trying to serialize 
the ParseContext. That'll capture a muhc wider range of cases where a 
ForkParser instance or some other non-serializable resource is being passed to 
a forked JVM.

> Tika.parseToString() causes ForkParser to try to serialize itself
> -
>
> Key: TIKA-830
> URL: https://issues.apache.org/jira/browse/TIKA-830
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: Jerome Lacoste
>Priority: Blocker
> Attachments: 
> 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch, 
> 0006-TIKA-830-refactor-tests-for-clarity.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-830) Tika.parseToString() causes ForkParser to try to serialize itself

2011-12-23 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175622#comment-13175622
 ] 

Jukka Zitting commented on TIKA-830:


I'm not sure if we should try to support passing a ForkParser as an argument to 
the Tika constructor.
The ForkParser expects the "real" parser instance to be passed in the 
ParseContext, which isn't currently supported by the Tika facade.

Instead of adding workarounds like this, I'd rather consider whether we want to 
extend the Tika facade to explicitly support forked parsing as a special mode. 
Personally I'd rather recommend people who need the ForkParser functionality to 
use the class directly instead of through the Tika facade (that's intended to 
be just a simple entry point for basic Tika functionality).

> Tika.parseToString() causes ForkParser to try to serialize itself
> -
>
> Key: TIKA-830
> URL: https://issues.apache.org/jira/browse/TIKA-830
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: Jerome Lacoste
>Priority: Blocker
> Attachments: 
> 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch, 
> 0006-TIKA-830-refactor-tests-for-clarity.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-832) ForkParser is unfriendly to code that prints things to its output

2011-12-23 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175591#comment-13175591
 ] 

Jukka Zitting commented on TIKA-832:


bq. I can write it if you want.

That would be great, thanks!

> ForkParser is unfriendly to code that prints things to its output
> -
>
> Key: TIKA-832
> URL: https://issues.apache.org/jira/browse/TIKA-832
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.0
>Reporter: Jerome Lacoste
>Priority: Minor
> Attachments: 
> TIKA-832_ForkClient_wait_a_bit_and_empty_the_initial_buffers.patch, 
> TIKA-832_ForkClient_wait_a_bit_when_asked_to_empty_the_initial_buffers.patch
>
>
> When given a java command that causes java to write something to the output, 
> like a debugging instruction, tika fails.
> I attach 2 patches that solve the issue in different way. Both use the same 
> unit test
> But I don't know it this is worth the complexity. At least to start a 
> discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

2011-12-19 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172335#comment-13172335
 ] 

Jukka Zitting commented on TIKA-810:


In revision 1220781 I updated the parser code in PDFBox to match latest changes 
in Tika.

See discussions on dev@tika and dev@pdfbox on how and where to maintain the 
code going forward.

> Upgrade to PDFbox 1.7.0 as available
> 
>
> Key: TIKA-810
> URL: https://issues.apache.org/jira/browse/TIKA-810
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once 
> it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-09 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166148#comment-13166148
 ] 

Jukka Zitting commented on TIKA-801:


bq. patch attached

Looks good, +1.

> ContentHandlerDecorator outputs invalid element
> ---
>
> Key: TIKA-801
> URL: https://issues.apache.org/jira/browse/TIKA-801
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0, 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Michael McCandless
> Attachments: FW  Testing.msg, TIKA-801.patch
>
>
> * Start Tika GUI
> * try opening test-outlook.msg (from tika-parsers test resources)
> * the following exception is thrown:
> {code}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@12e14ebc
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:245)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
>   at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2028)
>   at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2351)
>   at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
>   at 
> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
>   at javax.swing.AbstractButton.doClick(AbstractButton.java:389)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
>   at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:137)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
>   at java.awt.Component.processMouseEvent(Component.java:6373)
>   at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>   at java.awt.Component.processEvent(Component.java:6138)
>   at java.awt.Container.processEvent(Container.java:2085)
>   at java.awt.Component.dispatchEventImpl(Component.java:4735)
>   at java.awt.Container.dispatchEventImpl(Container.java:2143)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4621)
>   at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4282)
>   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4212)
>   at java.awt.Container.dispatchEventImpl(Container.java:2129)
>   at java.awt.Window.dispatchEventImpl(Window.java:2478)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:679)
>   at java.awt.EventQueue.access$000(EventQueue.java:85)
>   at java.awt.EventQueue$1.run(EventQueue.java:638)
>   at java.awt.EventQueue$1.run(EventQueue.java:636)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
>   at java.awt.EventQueue$2.run(EventQueue.java:652)
>   at java.awt.EventQueue$2.run(EventQueue.java:650)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at java.awt.EventQueue.dispatchEvent(EventQueue.java:649)
>   at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
>   at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
>   at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
>   at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
> Caused by: java.lang.NullPointerException
>   at 
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
>   at 
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
>   at 
> org.apache.tika.sax.TeeContentHandler.endElement(TeeContentH

[jira] [Commented] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-08 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165291#comment-13165291
 ] 

Jukka Zitting commented on TIKA-801:


See the org.apache.tika.sax.EmbeddedContentHandler class. It's explicitly 
designed for cases like this.
The ParsingEmbeddedDocumentExtractor class has an example of how to use 
EmbeddedContentHandler.

> ContentHandlerDecorator outputs invalid element
> ---
>
> Key: TIKA-801
> URL: https://issues.apache.org/jira/browse/TIKA-801
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0, 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Michael McCandless
> Attachments: FW  Testing.msg
>
>
> * Start Tika GUI
> * try opening test-outlook.msg (from tika-parsers test resources)
> * the following exception is thrown:
> {code}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@12e14ebc
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:245)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
>   at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2028)
>   at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2351)
>   at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
>   at 
> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
>   at javax.swing.AbstractButton.doClick(AbstractButton.java:389)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
>   at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:137)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
>   at java.awt.Component.processMouseEvent(Component.java:6373)
>   at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>   at java.awt.Component.processEvent(Component.java:6138)
>   at java.awt.Container.processEvent(Container.java:2085)
>   at java.awt.Component.dispatchEventImpl(Component.java:4735)
>   at java.awt.Container.dispatchEventImpl(Container.java:2143)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4621)
>   at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4282)
>   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4212)
>   at java.awt.Container.dispatchEventImpl(Container.java:2129)
>   at java.awt.Window.dispatchEventImpl(Window.java:2478)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:679)
>   at java.awt.EventQueue.access$000(EventQueue.java:85)
>   at java.awt.EventQueue$1.run(EventQueue.java:638)
>   at java.awt.EventQueue$1.run(EventQueue.java:636)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
>   at java.awt.EventQueue$2.run(EventQueue.java:652)
>   at java.awt.EventQueue$2.run(EventQueue.java:650)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at java.awt.EventQueue.dispatchEvent(EventQueue.java:649)
>   at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
>   at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
>   at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
>   at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
> Caused by: java.lang.NullPointerException
>   at 
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
>   at 
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

2011-12-06 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163701#comment-13163701
 ] 

Jukka Zitting commented on TIKA-800:


Note that calling TikaInputStream.get(InputStream) expects that you'll 
explicitly close() the returned stream.
In revision 1211027 I changed the code to use the 
TikaInputStream.get(InputStream, TemporaryResources) method that works better 
in this situation.

> mark/reset not supported from POIFSContainerDetector
> 
>
> Key: TIKA-800
> URL: https://issues.apache.org/jira/browse/TIKA-800
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0, 1.1
>Reporter: Andrzej Bialecki 
> Fix For: 1.1
>
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
>   at java.io.InputStream.reset(InputStream.java:330)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
>   at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
>   at 
> org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
>   at 
> org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
>   at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   ... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162795#comment-13162795
 ] 

Jukka Zitting commented on TIKA-801:


bq. EndDocumentShieldingContentHandler

IMHO we shouldn't be using the EDSCH mechanism. As noted by Nick in TIKA-646, 
the correct fix for cases like this would be to update the parsers to generate 
the metadata *before* they call endDocument. The EDSCH solution only fixes the 
symptoms but not the root cause of the problem.

> ContentHandlerDecorator outputs invalid element
> ---
>
> Key: TIKA-801
> URL: https://issues.apache.org/jira/browse/TIKA-801
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0, 1.1
>Reporter: Andrzej Bialecki 
>
> * Start Tika GUI
> * try opening test-outlook.msg (from tika-parsers test resources)
> * the following exception is thrown:
> {code}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@12e14ebc
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:245)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
>   at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2028)
>   at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2351)
>   at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
>   at 
> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
>   at javax.swing.AbstractButton.doClick(AbstractButton.java:389)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
>   at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:137)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
>   at java.awt.Component.processMouseEvent(Component.java:6373)
>   at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>   at java.awt.Component.processEvent(Component.java:6138)
>   at java.awt.Container.processEvent(Container.java:2085)
>   at java.awt.Component.dispatchEventImpl(Component.java:4735)
>   at java.awt.Container.dispatchEventImpl(Container.java:2143)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4621)
>   at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4282)
>   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4212)
>   at java.awt.Container.dispatchEventImpl(Container.java:2129)
>   at java.awt.Window.dispatchEventImpl(Window.java:2478)
>   at java.awt.Component.dispatchEvent(Component.java:4565)
>   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:679)
>   at java.awt.EventQueue.access$000(EventQueue.java:85)
>   at java.awt.EventQueue$1.run(EventQueue.java:638)
>   at java.awt.EventQueue$1.run(EventQueue.java:636)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
>   at java.awt.EventQueue$2.run(EventQueue.java:652)
>   at java.awt.EventQueue$2.run(EventQueue.java:650)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at java.awt.EventQueue.dispatchEvent(EventQueue.java:649)
>   at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
>   at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
>   at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
>   at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
> Caused by: java.lang.NullPointerException
>   at 
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
>   at 
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
>   at 
> org.apache.tika.sax.ContentHandlerDecorat

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

2011-12-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162734#comment-13162734
 ] 

Jukka Zitting commented on TIKA-800:


bq. If the POIFS detector (now by run by default if the parser jar is 
available) can't mark/reset, it should decline to detect

The Detector interface explicitly asks for the given InputStream to support 
mark/reset, so I think it's fine for the detector to throw an IOException like 
it's doing in this case.

> mark/reset not supported from POIFSContainerDetector
> 
>
> Key: TIKA-800
> URL: https://issues.apache.org/jira/browse/TIKA-800
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.0, 1.1
>Reporter: Andrzej Bialecki 
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
>   at java.io.InputStream.reset(InputStream.java:330)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
>   at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
>   at 
> org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
>   at 
> org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
>   at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>   ... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-12-01 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160919#comment-13160919
 ] 

Jukka Zitting commented on TIKA-623:


bq. Is there some way to proceed here without requiring libpst be mavenized?

Certainly. The only thing we'd need is to have the library available as a 
dependency on the central repository (otherwise we can't push out a Tika 
release with such a dependency). This requires no changes to the upstream 
library, just some extra metadata and appropriate -sources and -javadoc jars to 
accompany to the upload. See 
https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository
 for details.

Anyone can volunteer to take care of this. See for example 
https://groups.google.com/d/topic/tagsoup-friends/vIUe_jSR5YQ/discussion for a 
thread where I volunteered and did this for a recent release of the TagSoup 
library.

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

2011-11-21 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154154#comment-13154154
 ] 

Jukka Zitting commented on TIKA-786:


Cool, looks good. I was simultaneously approaching this from a slightly 
different angle (see 
https://github.com/jukka/tika/commit/97a15bdcd79549d3c5147b7b8f9b6f46a9bb8fc5), 
but your changes look nicer (I like the way you can give preference to non-Tika 
detectors) so let's go with that.

> Tika CLI --detect returns incorrect content-type for files with altered 
> extensions
> --
>
> Key: TIKA-786
> URL: https://issues.apache.org/jira/browse/TIKA-786
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.1
> Environment: Windows
>Reporter: John Mastarone
>Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the 
> following was requested as a new bug: Tika CLI will return incorrect content 
> type information when called with --detect for files that have had their 
> extensions modified (and nothing else).  MS Word (.doc) documents that have 
> their extension changed to .xls or .ppt will be incorrectly detected as Excel 
> or PowerPoint documents, whereas the --metadata option will determine the 
> content type correctly (as application/msword), based on the actual contents 
> of these mis-named files.  The same also occurs with other types of MS Office 
> 2003 documents, and could possibly occur with a wide range of document types. 
>  To quote Nick B., from the user mailing list: "If you look at the 
> TestMediaTypes class you'll see what you can get with just the mime magic and 
> filenames, and then there's TestContainerAwareDetector which shows the 
> correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

2011-11-21 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154135#comment-13154135
 ] 

Jukka Zitting commented on TIKA-786:


bq. Do we have any control over the ordering though?

Some. The type database always comes first, which for most use cases should be 
good enough.


bq. One situation where the mimetype detection is better is with truncated 
files.

Right. The good thing about the container detectors is that they only give a 
result (other than application/octet-stream) if they're really sure about the 
detection result. So with the proposed reverse detection order the type 
database would always be consulted last and be able to provide a fallback 
result in case none of the more accurate detectors worked.

> Tika CLI --detect returns incorrect content-type for files with altered 
> extensions
> --
>
> Key: TIKA-786
> URL: https://issues.apache.org/jira/browse/TIKA-786
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.1
> Environment: Windows
>Reporter: John Mastarone
>Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the 
> following was requested as a new bug: Tika CLI will return incorrect content 
> type information when called with --detect for files that have had their 
> extensions modified (and nothing else).  MS Word (.doc) documents that have 
> their extension changed to .xls or .ppt will be incorrectly detected as Excel 
> or PowerPoint documents, whereas the --metadata option will determine the 
> content type correctly (as application/msword), based on the actual contents 
> of these mis-named files.  The same also occurs with other types of MS Office 
> 2003 documents, and could possibly occur with a wide range of document types. 
>  To quote Nick B., from the user mailing list: "If you look at the 
> TestMediaTypes class you'll see what you can get with just the mime magic and 
> filenames, and then there's TestContainerAwareDetector which shows the 
> correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

2011-11-21 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154127#comment-13154127
 ] 

Jukka Zitting commented on TIKA-786:


Hmm, I didn't think of such a case when doing the DefaultDetector logic. My 
idea was that more accurate container detectors would just refine a more 
generic detection result from the basic detectors that are always run first. In 
this case though the basic detector ends up giving wrong results, which breaks 
my logic.

Since the container detectors give practically always correct results, I guess 
it's fine to always use their results. Or perhaps even better, we could check 
the detectors in reverse order so that the most accurate detection result is 
used as the starting point and less accurate detection based on things like the 
file name could only refine the detection result to a more specific media type.

> Tika CLI --detect returns incorrect content-type for files with altered 
> extensions
> --
>
> Key: TIKA-786
> URL: https://issues.apache.org/jira/browse/TIKA-786
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.1
> Environment: Windows
>Reporter: John Mastarone
>Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the 
> following was requested as a new bug: Tika CLI will return incorrect content 
> type information when called with --detect for files that have had their 
> extensions modified (and nothing else).  MS Word (.doc) documents that have 
> their extension changed to .xls or .ppt will be incorrectly detected as Excel 
> or PowerPoint documents, whereas the --metadata option will determine the 
> content type correctly (as application/msword), based on the actual contents 
> of these mis-named files.  The same also occurs with other types of MS Office 
> 2003 documents, and could possibly occur with a wide range of document types. 
>  To quote Nick B., from the user mailing list: "If you look at the 
> TestMediaTypes class you'll see what you can get with just the mime magic and 
> filenames, and then there's TestContainerAwareDetector which shows the 
> correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-784) Mimetype entry for DITA

2011-11-18 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152985#comment-13152985
 ] 

Jukka Zitting commented on TIKA-784:


bq. For now, I've invented mimetypes for the subtypes, but we might decide we 
want to have them all use the official type with parameters instead?

Is there code that needs to distinguish between these subtypes based on only 
the media type? The DITA type registration document seems to imply that all 
applications that process DITA should be able to deal with all subtypes, so 
having separate media types for them all is not needed.

I didn't quite understand how the "type" and "format" parameters of the type 
are supposed to be used. If they map to these separate kinds of DITA files, 
then using those to distinguish between the subtypes sounds like the correct 
approach.

If we do invent new media types, I would recommend that we stick with the 
.../x-tika-... pattern to make it clear that these are not generally used types.

> Mimetype entry for DITA
> ---
>
> Key: TIKA-784
> URL: https://issues.apache.org/jira/browse/TIKA-784
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.0
>Reporter: Nick Burch
>
> Currently, we don't have mimetype entries for DITA. There is a provisional 
> mimetype
> According to 
> http://docs.oasis-open.org/dita/v1.2/cs01/spec/non-normative/DITA-mime-type.html
>  there is a provisional mimetype of "application/dita+xml" for DITA files
> There are then three kinds of DITA file, which apparently all use the same 
> mimetype:
>  DITA Topic - .dita
>  DITA Map - .ditamap
>  DITA Conditional Processing Profile - .ditaval
> DITA is XML based, so we should be able to do XML detection in addition to 
> filename matching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-11-17 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152405#comment-13152405
 ] 

Jukka Zitting commented on TIKA-734:


Did you see the parse() method [1] that returns a java.io.Reader instead of a 
String? That should achieve the same thing you're doing.

Note however that only some of the parsers in Tika support such streaming. 
Others like the MS Office parser will in any case parse the entire input 
document or at least significant parts of it before starting to output any of 
the extracted content.

[1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html#parse(java.io.File)

> Out of memory exception with Xlsx file less than 5 MB
> -
>
> Key: TIKA-734
> URL: https://issues.apache.org/jira/browse/TIKA-734
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
> Environment: Windows Vista , JUnit test cases running in RAD, JVM 
> heap memory - 500MB
>Reporter: Anirban Mitra
> Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 
> MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we 
> have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-778) NullPointerException in tika-app, parsing PDF content

2011-11-15 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150395#comment-13150395
 ] 

Jukka Zitting commented on TIKA-778:


Looks like the problem is coming from the HTML serializer rather than from 
PDFBox.

I can't reproduce this locally. Instead of sharing the test PDF here publicly, 
is it possible for you to send it to me in private? Alternatively, can you 
check if the following command works as expected (should produce valid HTML):

{code}
$ java -jar tika-app-1.0.jar --html /path/to/test.pdf
{code}

> NullPointerException in tika-app, parsing PDF content
> -
>
> Key: TIKA-778
> URL: https://issues.apache.org/jira/browse/TIKA-778
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser
>Affects Versions: 1.0
>Reporter: Bastian Mathes
>
> I try to extract text from some pdf files with the tika app. In version 0.10 
> the error 
> ERROR - Error: Could not parse predefined CMAP file for '--UCS2'
> is printed on the command line, but text extraction works and is correct.
> In version 1.0 I get the same error message on the command line, but also 
> receive an exception and no text is extracted:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@62bc36ff
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
>   at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
>   at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
>   at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
>   at 
> javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
>   at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
>   at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
>   at java.awt.Component.processMouseEvent(Component.java:6288)
>   at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>   at java.awt.Component.processEvent(Component.java:6053)
>   at java.awt.Container.processEvent(Container.java:2041)
>   at java.awt.Component.dispatchEventImpl(Component.java:4651)
>   at java.awt.Container.dispatchEventImpl(Container.java:2099)
>   at java.awt.Component.dispatchEvent(Component.java:4481)
>   at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577)
>   at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
>   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>   at java.awt.Container.dispatchEventImpl(Container.java:2085)
>   at java.awt.Window.dispatchEventImpl(Window.java:2478)
>   at java.awt.Component.dispatchEvent(Component.java:4481)
>   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643)
>   at java.awt.EventQueue.access$000(EventQueue.java:84)
>   at java.awt.EventQueue$1.run(EventQueue.java:602)
>   at java.awt.EventQueue$1.run(EventQueue.java:600)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
>   at java.awt.EventQueue$2.run(EventQueue.java:616)
>   at java.awt.EventQueue$2.run(EventQueue.java:614)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
>   at java.awt.EventQueue.dispatchEvent(EventQueue.java:613)
>   at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
>   at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
>   at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
>   at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
> Caused by: java.lang.NullPointerException
>   at 
> com.sun.org.apache.xml.int

[jira] [Commented] (TIKA-773) .NET version of Tika

2011-11-15 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150392#comment-13150392
 ] 

Jukka Zitting commented on TIKA-773:


There's now an ikvm profile in the tika-app POM that uses IKVM to produce a 
tika-app DLL. This DLL can be used in .NET environments to access pretty much 
all of Tika's functionality, but the API (mapped directly from Java) isn't very 
.NET-friendly.

My next goal is to figure out how to add the suggested API sugar to the DLL.

> .NET version of Tika
> 
>
> Key: TIKA-773
> URL: https://issues.apache.org/jira/browse/TIKA-773
> Project: Tika
>  Issue Type: New Feature
>  Components: packaging
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>  Labels: .NET, IKVM, NPanday
>
> As a followup to TIKA-212 and inspired by efforts like [1], I'd like to set 
> up a .NET version of Tika based on IKVM and NPanday. The goal would be to 
> produce a Tika DLL that contains all the parser libraries and can be used 
> natively in any .NET environment with some API sugar on top to make the Tika 
> facade class work more smoothly with .NET (for example, use 
> System.IO.FileInfo instead of java.io.File).
> [1] 
> http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-774) ExifTool Parser

2011-11-09 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147342#comment-13147342
 ] 

Jukka Zitting commented on TIKA-774:


Some notes:

* We already have existing places for metadata schemas like Dublin Core and 
XMPDM. It would be better if the new metadata properties you're adding were 
located next to the already existing similar properties instead of in the 
separate new ExifTool interface.

* We already have parsers for JPEG, PNG and TIFF. Instead of adding a 
conflicting new parser for the same formats, it would be better if the existing 
parsers could be extended with this new functionality.

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, newbie, patch,
> Fix For: 1.1
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-775) Embed Capabilities

2011-11-09 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147339#comment-13147339
 ] 

Jukka Zitting commented on TIKA-775:


I'd like to have a concrete use case for introducing a new concept like this. 
What exact need are you addressing? Also, are there other existing tools that 
could be used instead of coming up with a new API. This seems like a pretty 
significant new feature, so it would be best if we did it right from the 
beginning.

Design-wise it would be better for the embed() method to write it's results to 
an OutputStream given as an argument (just like the Parser interface takes a 
ContentHandler argument). Returning an InputStream brings up all sorts of 
issues about timing, error reporting, etc.

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.1
>
> Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862
 ] 

Jukka Zitting commented on TIKA-772:


The metacharacters you mention do sound suspicious. Here's what the attached 
it.html looks inside; no weird metacharacters here:

{noformat}
$ od -c it.html | head
000   <   ?   x   m   l   v   e   r   s   i   o   n   =   "   1
020   .   0   "   e   n   c   o   d   i   n   g   =   "   U   T
040   F   -   8   "   ?   >  \n   <   h   t   m   l   >   <   p   >
060   P   a   r   e   r   e   d   e   l   C   o   m   i   t
100   a   t   o   e   c   o   n   o   m   i   c   o   e
120   s   o   c   i   a   l   e   e   u   r   o   p   e   o
140   s   u   l   t   e   m   a   I   l   r   u   o   l
160   o   d   e   l   l   a   s   o   c   i   e   t 303 240
200   c   i   v   i   l   e   n   e   l   l   e   r   e
220   l   a   z   i   o   n   i   U   E   -   S   e   r   b   i
{noformat}

I still get "text/html" when running the test against this file.

> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 0.10
>Reporter: Joseph Vychtrle
>Assignee: Jukka Zitting
>  Labels: detection, media-type
> Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> 
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List allDocs = DocumentProvider.docsAsList();
>   Map failed = new HashMap();
>   for (Document doc : allDocs) {
>   Tika tika = new Tika();
>   String type = tika.detect(TikaInputStream.get(doc.getFile()));
>   if(!doc.getMediaType().toString().equals(type))
>   failed.put(doc, type);  
>   }
>   
>   for (Document doc : failed.keySet()) {
>   log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>   }
>   assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144854#comment-13144854
 ] 

Jukka Zitting commented on TIKA-772:


The test case you added prints out "text/html" for me when run against the 
it.html file included in the zip you attached. Can you attach the exact 
"/tmp/docProv/html/it.html" file that produces the "text/plain" output?

> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 0.10
>Reporter: Joseph Vychtrle
>Assignee: Jukka Zitting
>  Labels: detection, media-type
> Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> 
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List allDocs = DocumentProvider.docsAsList();
>   Map failed = new HashMap();
>   for (Document doc : allDocs) {
>   Tika tika = new Tika();
>   String type = tika.detect(TikaInputStream.get(doc.getFile()));
>   if(!doc.getMediaType().toString().equals(type))
>   failed.put(doc, type);  
>   }
>   
>   for (Document doc : failed.keySet()) {
>   log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>   }
>   assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144849#comment-13144849
 ] 

Jukka Zitting commented on TIKA-772:


The latter method makes also the .html suffix available to the detector, which 
helps Tika guess the type of the document. Anyway, Tika should be able to 
detect the correct type also with the former version.

Can you check what output you get from the following two commands:

{code}
$ java -jar tika-app-0.10.jar --detect < it.html
$ java -jar tika-app-0.10.jar --detect it.html
{code}

These calls are roughly equivalent to the two method calls you mentioned. On my 
computer both return text/html.

> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 0.10
>Reporter: Joseph Vychtrle
>Assignee: Jukka Zitting
>  Labels: detection, media-type
> Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> 
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List allDocs = DocumentProvider.docsAsList();
>   Map failed = new HashMap();
>   for (Document doc : allDocs) {
>   Tika tika = new Tika();
>   String type = tika.detect(TikaInputStream.get(doc.getFile()));
>   if(!doc.getMediaType().toString().equals(type))
>   failed.put(doc, type);  
>   }
>   
>   for (Document doc : failed.keySet()) {
>   log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>   }
>   assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144836#comment-13144836
 ] 

Jukka Zitting commented on TIKA-772:


I piped the files to tika-app to prevent it from seeing the file extension.

I wonder if the problem has something to do with the way you're providing the 
files to Tika. Can you try the following code and paste the output here?

{code}
File file = doc.getFile();
Tika tika = new Tika();
String type = tika.detect(file);
if (!"text/html".equals(type)) {
System.out.println(file.getName() + ": " + type);
Reader reader = new FileReader(file);
try {
char[] c = new char[200];
int n = reader.read(c);
System.out.println(new String(c, 0, n));
} finally {
reader.close();
}
}
{code}

Alternatively, can you modify your code to something I could run without access 
to the rest of your codebase? Without a test case that I can execute locally 
it's hard to tell where the problem may be.

> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 0.10
>Reporter: Joseph Vychtrle
>Assignee: Jukka Zitting
>  Labels: detection, media-type
> Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> 
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List allDocs = DocumentProvider.docsAsList();
>   Map failed = new HashMap();
>   for (Document doc : allDocs) {
>   Tika tika = new Tika();
>   String type = tika.detect(TikaInputStream.get(doc.getFile()));
>   if(!doc.getMediaType().toString().equals(type))
>   failed.put(doc, type);  
>   }
>   
>   for (Document doc : failed.keySet()) {
>   log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>   }
>   assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144763#comment-13144763
 ] 

Jukka Zitting commented on TIKA-772:


Can you attach an example document that illustrates this problem?

PS. You can simplify (and improve) your code by using 
{{tika.detect(doc.getFile())}} instead of 
{{tika.detect(TikaInputStream.get(doc.getFile()))}}.

> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 0.10
>Reporter: Joseph Vychtrle
>  Labels: detection, media-type
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> 
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List allDocs = DocumentProvider.docsAsList();
>   Map failed = new HashMap();
>   for (Document doc : allDocs) {
>   Tika tika = new Tika();
>   String type = tika.detect(TikaInputStream.get(doc.getFile()));
>   if(!doc.getMediaType().toString().equals(type))
>   failed.put(doc, type);  
>   }
>   
>   for (Document doc : failed.keySet()) {
>   log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>   }
>   assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-763) Update license metadata

2011-10-28 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138545#comment-13138545
 ] 

Jukka Zitting commented on TIKA-763:


I updated the embedded LICENSE files in revisions 1190449 and 1190451. They are 
now up to date with our list of dependencies.

Note that during a review of the new dependencies, I encountered the following 
LGPL files in the NetCDF source tree:

* ucar/nc2/iosp/fysat/Fysatiosp.java
* ucar/nc2/iosp/fysat/FysatHeader.java
* ucar/nc2/dataset/transform/VOceanSG1.java
* ucar/unidata/geoloc/vertical/OceanSG2.java
* ucar/unidata/geoloc/vertical/OceanSG1.java

Ideally we should contact NetCDF and ask them to relicense or replace these 
files, but that's probably not a short term solution. For tika-app and 
tika-bundle we could explicitly exclude these files from being included, but 
I'm not sure what the functional impact of that would be.

The following files are under ALv1.1:

* ucar/unidata/io/bzip2/BZip2Constants.java
* ucar/unidata/io/bzip2/CBZip2InputStream.java
* ucar/unidata/io/bzip2/CRC.java

These can be covered simply by including a copy of the ALv1.1 in the LICENSE 
files.

> Update license metadata
> ---
>
> Key: TIKA-763
> URL: https://issues.apache.org/jira/browse/TIKA-763
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Jukka Zitting
>Priority: Blocker
> Fix For: 1.0
>
>
> I was going through our dependency tree, and realized that we haven't updated 
> the LICENSE files in tika-app and tika-bundle to reflect some of the more 
> recent additions. We need to bring this license metadata up to date before 
> releasing 1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135257#comment-13135257
 ] 

Jukka Zitting commented on TIKA-761:


See revision 1188803 for a slightly modified version of your latest patch that 
seems to do the trick.

> Provide version number by CLI argument -V
> -
>
> Key: TIKA-761
> URL: https://issues.apache.org/jira/browse/TIKA-761
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, general
>Reporter: Ingo Renner
>Priority: Minor
> Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
> TIKA-761.diff
>
>
> I'd like to get the Apache Tika version number through CLI argument -V or 
> --version. The patch is trivial and basically finished. The only thing 
> missing (because Java is not my native programming language) is the actual 
> version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135167#comment-13135167
 ] 

Jukka Zitting commented on TIKA-761:


bq. the dots will still be replaced with slashes

Only if the path is relative. If the path starts with a slash, like in 
{{/META-INF/...}}, no dot replacement will occur.

> Provide version number by CLI argument -V
> -
>
> Key: TIKA-761
> URL: https://issues.apache.org/jira/browse/TIKA-761
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, general
>Reporter: Ingo Renner
>Priority: Minor
> Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
> TIKA-761.diff
>
>
> I'd like to get the Apache Tika version number through CLI argument -V or 
> --version. The patch is trivial and basically finished. The only thing 
> missing (because Java is not my native programming language) is the actual 
> version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135129#comment-13135129
 ] 

Jukka Zitting commented on TIKA-761:


I'd simply hardcode the properties file path as 
{{/META-INF/maven/org.apache.tika/tika-app/pom.properties}}. It's not going to 
change any time soon.

> Provide version number by CLI argument -V
> -
>
> Key: TIKA-761
> URL: https://issues.apache.org/jira/browse/TIKA-761
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, general
>Reporter: Ingo Renner
>Priority: Minor
> Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
> TIKA-761.diff
>
>
> I'd like to get the Apache Tika version number through CLI argument -V or 
> --version. The patch is trivial and basically finished. The only thing 
> missing (because Java is not my native programming language) is the actual 
> version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134151#comment-13134151
 ] 

Jukka Zitting commented on TIKA-761:


bq. It seems - from what I read - these are only available after building the 
app, which would be too late( ? )

It's not too late. After all, you can only run the app after it has been built. 
:-) We need the version information only at runtime, so it's not a problem if 
it's not available at compile-time.

> Provide version number by CLI argument -V
> -
>
> Key: TIKA-761
> URL: https://issues.apache.org/jira/browse/TIKA-761
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, general
>Reporter: Ingo Renner
>Priority: Minor
> Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff
>
>
> I'd like to get the Apache Tika version number through CLI argument -V or 
> --version. The patch is trivial and basically finished. The only thing 
> missing (because Java is not my native programming language) is the actual 
> version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134113#comment-13134113
 ] 

Jukka Zitting commented on TIKA-761:


+1 Looks good.

As a possible improvement, as Nick noted, you could read the version number 
from {{/META-INF/MANIFEST.MF}} or from 
{{/META-INF/maven/org.apache.tika/tika-app/pom.properties}} so we don't need to 
set up a custom properties file for this.

Also, it would be cool if that information could be made available also through 
the Java API in tika-core. For example the toString() method of the Tika facade 
class could return something like "Apache Tika x.y", and {{tika-app --version}} 
would simply output that string.

bq. using -v (lowercase)

-1 for backwards compatibility reasons.

> Provide version number by CLI argument -V
> -
>
> Key: TIKA-761
> URL: https://issues.apache.org/jira/browse/TIKA-761
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, general
>Reporter: Ingo Renner
>Priority: Minor
> Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff
>
>
> I'd like to get the Apache Tika version number through CLI argument -V or 
> --version. The patch is trivial and basically finished. The only thing 
> missing (because Java is not my native programming language) is the actual 
> version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-755) Add getDetector() method to TikaConfig

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129971#comment-13129971
 ] 

Jukka Zitting commented on TIKA-755:


Hmm, I looked at the interaction between Tika and TikaConfig, and actually I 
now think it's better if we leave the AutoDetectParser instantiation there.

This way the TikaConfig class is responsible for producing composite parsers 
and detectors based on explicit configuration or the default classloading 
mechanism, and the Tika facade (or the AutoDetectParser class directly) remains 
responsible for adding autodetection and other extra features on top of the 
basic configuration.

So I guess we can resolve this as fixed.

> Add getDetector() method to TikaConfig
> --
>
> Key: TIKA-755
> URL: https://issues.apache.org/jira/browse/TIKA-755
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.10
>Reporter: Nick Burch
>Assignee: Nick Burch
> Fix For: 1.0
>
>
> As discussed on the mailing list, we should add a getDetector() method to 
> TikaConfig. This would return a DefaultDetector that was created with the 
> same classloader as the DefaultParser was
> As part of this, we should update the Tika class to get the DefaultDetector 
> from the TikaConfig, rather than creating internally. We should also switch 
> the Tika class to not create its own AutoDetectParser, but instead use the 
> DefaultParser from TikaConfig
> Discussion is:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201110.mbox/%3calpine.deb.2.00.1110171330160.7...@urchin.earth.li%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-756) XMP output from Tika CLI

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129957#comment-13129957
 ] 

Jukka Zitting commented on TIKA-756:


Rough first version committed in revision 1185805.

> XMP output from Tika CLI
> 
>
> Key: TIKA-756
> URL: https://issues.apache.org/jira/browse/TIKA-756
> Project: Tika
>  Issue Type: New Feature
>  Components: cli, metadata
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>  Labels: metadata, xmp
>
> It would be great if the Tika CLI could output metadata also in the XMP 
> format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129637#comment-13129637
 ] 

Jukka Zitting commented on TIKA-754:


I don't think it's necessarily a good idea to make an assumption like this in 
XHTMLContentHandler.

It's better to explicitly emit an  element in a specific Parser 
implementation when it's clear that the underlying document format really does 
indicate a hard line break instead of just a soft one that could/should be 
positioned differently if the displayed line length changes.

> Automatic line break insertion (BR element) instead of '\n' in 
> XHTMLContentHandler
> --
>
> Key: TIKA-754
> URL: https://issues.apache.org/jira/browse/TIKA-754
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 0.10, 1.0
>Reporter: Pablo Queixalos
>Priority: Minor
> Fix For: 1.0
>
> Attachments: TIKA-754.poc.patch
>
>
> As seen with some parsers (PDF, PPT), some text blocks still contains text 
> carriage returns ('\n') in the outputted XHTML. 
> A global fix for this could be located in XHTMLContentHandler.characters(...).
> By analyzing the given char array, when a '\n' char is encountered insert a 
> BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-10-13 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126938#comment-13126938
 ] 

Jukka Zitting commented on TIKA-657:


In revision 1183109 I increased the default line and header length limits to 
cope with valid messages in the Enron corpus. With that change I saw no more 
exceptions at least in the first 30k messages.

I'll run a full test over the entire corpus to see if there are any other 
problems left before we can resolve this issue.

> Email parser gets into trouble on malformed html in enron corpus
> 
>
> Key: TIKA-657
> URL: https://issues.apache.org/jira/browse/TIKA-657
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Benson Margulies
>Assignee: Julien Nioche
> Attachments: tika0.9-enron-errors-summary.txt, 
> tika1.0-snapshot-enron-errors.summary.txt
>
>
> There is a very large corpus of email addresses available: 
> http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected 
> RuntimeException' errors resulting from tagsoup throwing on truly awful html. 
> It seems to me that being able to do something with this entire stack would 
> make a good '1.0' criteria for tika's email parser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122639#comment-13122639
 ] 

Jukka Zitting commented on TIKA-513:


Is there a DjVu parser we could use?

> Support of Deja Vu (DjVu) format
> 
>
> Key: TIKA-513
> URL: https://issues.apache.org/jira/browse/TIKA-513
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Oleg Tikhonov
>
> It might be great if Tika could provide such a parser. Any 
> suggestions/thoughts? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-272) Expose characters offsets information while parsing text-based inputs.

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122625#comment-13122625
 ] 

Jukka Zitting commented on TIKA-272:


See PDFBOX-577 for some related work in PDFBox.

> Expose characters offsets information while parsing text-based inputs.
> --
>
> Key: TIKA-272
> URL: https://issues.apache.org/jira/browse/TIKA-272
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.4
>Reporter: David Causse
>Priority: Minor
>
> It would be interesting to access actual characters offset information when 
> parsing text-based files (I don't know if it's interesting/usable/doable for 
> binary formats...).
> If I use tika for parsing HTML and inject parsed strings into lucene, I'm not 
> able to tell to the lucene analyzer where is the actual character in the 
> original input.
> If tika expose this information It will permit to use unmodified lucene 
> analyzers behind tika and implement for example pretty highlighting in search 
> result (see google cache view).
> With new Lucene Attribute API it could be fairly easy to provide a sort of 
> TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> 
> unmodified lucene analyzer -> tika offset correction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-381) HtmlParser should strip linefeeds out of links

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122617#comment-13122617
 ] 

Jukka Zitting commented on TIKA-381:


The relevant TagSoup scanner state transitions are shown below:

{code}
S_QUOT, ' ', A_SP, S_QUOT,
S_QUOT, '\n', A_SP, S_QUOT,
S_QUOT, '\t', A_SP, S_QUOT,
{code}

Which, as you mention, transforms all linefeeds (and other whitespace like 
tabs) to spaces. Changing the A_SP action to A_SAVE should fix that, but I 
didn't yet look into how easy it would be to customize TagSoup like that.

> HtmlParser should strip linefeeds out of links
> --
>
> Key: TIKA-381
> URL: https://issues.apache.org/jira/browse/TIKA-381
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.6
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> A number of HTML pages contain links where the URL has a linefeed in the 
> middle of it.
> Browsers such as Firefox will automatically remove the character but Tika 
> passes it back, which results in a broken URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app

2011-10-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121157#comment-13121157
 ] 

Jukka Zitting commented on TIKA-636:


Do you still see this problem with Tika 0.10? If yes, please attach an example 
file that can be used to reproduce the issue.

> Taking very high heap space while parsing docx - Resulting in OOM in tha app
> 
>
> Key: TIKA-636
> URL: https://issues.apache.org/jira/browse/TIKA-636
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
> Environment: Linux box
> JDK 1.6
>Reporter: Jayesh K Rajpurohit
>
> I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for 
> parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM 
> space which is too high resulting in OOM in the application.
> Do I have to tweak in at some place for reducing down the memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-05 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121032#comment-13121032
 ] 

Jukka Zitting commented on TIKA-734:


Tika 0.10 is now available. If the problem still occurs, please attach an 
example file that can be used to reproduce the issue.

> Out of memory exception with Xlsx file less than 5 MB
> -
>
> Key: TIKA-734
> URL: https://issues.apache.org/jira/browse/TIKA-734
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
> Environment: Windows Vista , JUnit test cases running in RAD, JVM 
> heap memory - 500MB
>Reporter: Anirban Mitra
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 
> MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we 
> have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

2011-10-01 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118861#comment-13118861
 ] 

Jukka Zitting commented on TIKA-735:


A parser should always produce valid XHTML output. If there's an embedded 
document that's fed into a recursive parse() call, the EmbeddedContentHandler 
and BodyContentHandler class can (and should) be used to include only the 
extracted body content of the embedded document. See the 
ParsingEmbeddedDocumentExtractor class for how this is done. In fact I'd 
recommend simply using the ParsingEmbeddedDocumentExtractor class directly, 
just like package, POIFS, and OOXML parsers already do.

Anyway, as mentioned by Nick elsewhere, it's probably not worth it to fix the 
current code since it'll probably in any case be rewritten to use the ODF 
toolkit.

> OpenOffice parser: embedded OLE docs are extracted at the end, as extra 
> ...
> 
>
> Key: TIKA-735
> URL: https://issues.apache.org/jira/browse/TIKA-735
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Priority: Minor
> Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="application/vnd.oasis.opendocument.presentation"/>
> 
> 
> 
> 
> Main text on page 1
> 
> 
> 
> 
>  Main text on page 2
> 
> 
> 
> 
> 
> http://www.w3.org/1999/xhtml";>
> 
> 
>  content="application/vnd.oasis.opendocument.presentation"/>
> 
> 
> 
> Here is some embedded text on page 1
> http://www.w3.org/1999/xhtml";>
> 
> 
>  content="application/vnd.oasis.opendocument.presentation"/>
> 
> 
> 
> Here is some embedded text on page 2
> 
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira