[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-19 Thread Jukka Zitting (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822055#comment-16822055
 ] 

Jukka Zitting commented on TIKA-2849:
-

SGTM! Alternatively you could overload the getFile() method with a version that 
takes a maxBytes argument.

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-19 Thread Jukka Zitting (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822038#comment-16822038
 ] 

Jukka Zitting commented on TIKA-2849:
-

Exactly (sorry, s/n/maxBytesToSpool/).

That should work as before on most files, but would avoid the problematic case 
of potentially spooling unlimited amounts of data to local disk. We can make 
the max spooling limit configurable, or instruct users to call 
TikaInputStream.getFile() to force full spooling before calling detect() if 
they prefer the current logic of always looking at the full file.

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-19 Thread Jukka Zitting (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821995#comment-16821995
 ] 

Jukka Zitting commented on TIKA-2849:
-

How about something like this:

{code:java}
    TikaInputStream tis = TikaInputStream.cast(input);
    if (tis != null && tis.hasFile()) {
        return detect(tis.getFile());
    }

    if (input == null || !hasOleHeader(input)) {
        return MediaType.OCTET_STREAM;
    }

    int maxBytesToSpool = 16 * 1024 * 1024; // Or some configurable limit.
    InputStream lookahead = new LookaheadInputStream(stream, n);
    try {
        tis = new TikaInputStream(lookahead);
        try {
            if (tis.getLength() < maxBytesToSpool) {
                return detect(tis.getFile());
            }
        } finally {
            tis.close();
        }
    } finally {
        lookahead.close();
    }

    return MediaType.OCTET_STREAM;
{code}

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-09 Thread Jukka Zitting (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813593#comment-16813593
 ] 

Jukka Zitting commented on TIKA-2849:
-

There's a related TODO in {{detectZipFormat()}} to check for {{hasFile()}} 
before using the potentially expensive {{getFile()}} method. We could make the 
detector fall back to {{tryStreamingDetection()}} when the file is not 
available (or the stream is too large to be spooled into a temporary file).

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Priority: Major
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2001) Parsing XML outputs empty string

2016-06-09 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322710#comment-15322710
 ] 

Jukka Zitting commented on TIKA-2001:
-

By default Tika only extracts the text between XML tags, not things like 
attribute values. Since all the content in this XML file is in the attributes, 
nothing gets extracted.

What kind of output would make sense in this case?

Perhaps something like this:

{noformat}
0 0 2016-06-03 06:21:34 2016-06-03 06:21:37 0.002
  0 0 0 0 0 0 0 0 2016-06-03 06:21:37 no
{noformat}

or like this:

{noformat}
spocosy
  subscription-update subscriptionid 0 requestid 0 last_push 2016-06-03 
06:21:34 current_push 2016-06-03 06:21:37 exec 0.002
lineup id 0 event_participantsFK 0 participantFK 0 lineup_typeFK 0 
shirt_number 0 pos 0 enet_pos 0 n 0 ut 2016-06-03 06:21:37 del no
{noformat}


> Parsing XML outputs empty string
> 
>
> Key: TIKA-2001
> URL: https://issues.apache.org/jira/browse/TIKA-2001
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11, 1.12, 1.13
>Reporter: George L. Yermulnik
>Priority: Minor
>
> Can't get Tika parse my xml files:
> {code}
> root@spring:/tmp# java -version
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
> root@spring:/tmp# cat /tmp/xml/5751061032fbd-7148.xml
> 
>  last_push="2016-06-03 06:21:34" current_push="2016-06-03 06:21:37" 
> exec="0.002"> lineup_typeFK="0" shirt_number="0" pos="0" enet_pos="0" n="0" ut="2016-06-03 
> 06:21:37" del="no"/>
> root@spring:/tmp# for i in 3 2 1; do
> echo -n "tika-app-1.1${i}.jar: "
> java -jar tika-app-1.1${i}.jar --text /tmp/xml/5751061032fbd-7148.xml
> done
> tika-app-1.13.jar:
> tika-app-1.12.jar:
> tika-app-1.11.jar:
> root@spring:/tmp#
> {code}
> Appreciate any help. Thanx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725703#comment-14725703
 ] 

Jukka Zitting commented on TIKA-1726:
-

Could createTemporaryFile() be accompanied with createTemporaryPath()?

I prefer getPath() as the complement for getFile().

As for deprecation, I'm not too convinced that we should do that. There is a 
lot of code out there (not just Tika clients but other libraries and frameworks 
that our clients interact with) that works with Files, so forcing clients to 
upgrade to Paths seems unnecessary, at least as long as java.io.File itself 
remains undeprecated.

> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, deprecating the old method until an unknown 
> tika major release.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> _tika-parsers:_
> - {{org.apache.tika.parser.ParsingReader}} constructor
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-08-30 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14722701#comment-14722701
 ] 

Jukka Zitting commented on TIKA-1672:
-

I'm actually not sure if we should do this. The benefit of keeping tika-java7 
as a separate component is that one needs to explicitly add the dependency to 
alter the behavior of the Files.probeContentType() method. Merging the 
component into tika-core might cause unexpected probeContentType() changes in 
applications that already depend on tika-core.

> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.11
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1719) Utilize try-with-resources where it is trivial

2015-08-30 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1719.
-
Resolution: Fixed
  Assignee: Jukka Zitting

Committed in revision 1700195. Thanks a lot for this patch! I remember writing 
many of these blocks earlier and wishing that try-with-resources was already 
available for us...

Note that I broke some of the lines to fit within 80 columns. We aren't too 
consistent in applying that convention, but it's a good thing to strive for. 
Also, the review was complicated a bit by the extra whitespace changes included 
in the patch. All of those were changes in the right direction, so thanks!

> Utilize try-with-resources where it is trivial
> --
>
> Key: TIKA-1719
> URL: https://issues.apache.org/jira/browse/TIKA-1719
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, example, gui, packaging, parser, server
>Reporter: Yaniv Kunda
>Assignee: Jukka Zitting
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.11
>
> Attachments: TIKA-1719.patch
>
>
> The following type of resource usages:
> {code}
> AutoCloseable resource = ...;
> try {
> // do something with resource
> } finally {
> resource.close();
> }
> {code}
> {code}
> AutoCloseable resource = null;
> try {
> resource = ...;
> // do something with resource
> } finally {
> if (resource != null) {
> resource.close();
> }
> }
> {code}
> and similar constructs can be trivially replaced with Java 7's 
> try-with-resource statement:
> {code}
> try (AutoCloseable resource = ...) {
> // do something with resource
> }
> {code}
> This brings more concise code with less chance of causing resource leaks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()

2015-08-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1720.
-
Resolution: Fixed
  Assignee: Jukka Zitting

Thanks! Committed in revision 1698150.

> Collect multiple exceptions in TemporaryResources.close() using 
> Throwable.addSuppressed()
> -
>
> Key: TIKA-1720
> URL: https://issues.apache.org/jira/browse/TIKA-1720
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1720.patch
>
>
> TemporaryResource.close() currently collects exceptions throw by trying to 
> close its resources in a list.
> When the time to propagate an exception comes, information is lost - the 
> thrown exception contains a message with the string descriptions of all 
> exceptions, and the first exception as the cause - there is no stack trace 
> describing what went wrong closing a resource.
> In addition, the thrown exception is IOExceptionWithCause, copied from 
> commons-io, which is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1721) Replace IOExceptionWithCause in ForkClient

2015-08-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1721.
-
Resolution: Fixed
  Assignee: Jukka Zitting

Thanks! Committed in 1698101.

PS. Note that we break lines at 80 columns.

> Replace IOExceptionWithCause in ForkClient
> --
>
> Key: TIKA-1721
> URL: https://issues.apache.org/jira/browse/TIKA-1721
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1721.patch
>
>
> IOExceptionWithCause (copied from commons-io) is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL

2015-08-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1722.
-
Resolution: Fixed
  Assignee: Jukka Zitting

Thanks! Committed in revision 1698100.

My original thinking with these methods was to ensure that there is no 
difference in how a File or a file:// URL gets processed. I think that's 
already well covered, so there's not much need for the extra File->URL->File 
roundtrip.

> Tika methods that accept a File needlessly convert it to a URL
> --
>
> Key: TIKA-1722
> URL: https://issues.apache.org/jira/browse/TIKA-1722
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1722.patch
>
>
> The following methods:
> - Tika.detect(File)
> - Tika.parse(File)
> - Tika.parseToString(File)
> Convert the given File to a URL and use the corresponding overloaded method 
> that accepts a URL.
> This seems like a shortcut, but essentially does the following:
> # Converts the file to a URI
> # Converts the URI to a URL
> # Calls TikaInputStream.get(URL, Metadata), which then performs the following 
> special handling:
> # Checks if the protocol is "file"
> # Tries to convert the URL (back) to a URI
> # Creates a File around the URI
> # Checks if file.isFile() 
> # Calls TikaInputStream.get(File, Metadata)
> The special handling in TikaInputStream.get(URL/URI) is a good optimization 
> for in-the-wild file resources, but for internal uses it can be skipped - 
> making Tika call TikaInputStream.get(File, Metadata) directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716318#comment-14716318
 ] 

Jukka Zitting commented on TIKA-1706:
-

Note that o.a.tika.io is a part of the public API of tika-core, so even if we 
restore the commons-io dependency we should keep these classes for backwards 
compatibility (perhaps as dummies that just inherit the relevant commons-io 
classes or redirect static calls to there).

I don't have a strong opinion here. I do think that the "no dependencies" 
principle of tika-core is useful and worth the overhead of a dozen duplicated 
classes. And a 30% increase in the tika-core footprint because of the added 
dependency would still be non-trivial. On the other hand the argument about 
missing out on improvements in commons-io is valid.

Personally I'd start here by checking what exactly has changed in the classes 
we duplicate from commons-io. If it's just a few lines then I'd just merge 
those changes to Tika and be happy with that for the next five years. If there 
are more substantial improvements, switching back to a dependency is probably 
worth it.

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-27 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009666#comment-14009666
 ] 

Jukka Zitting commented on TIKA-1294:
-

+1 to making this configurable and off by default.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1287) Update NetCDF .jar file on Maven Central

2014-05-06 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-1287:


Issue Type: Improvement  (was: Bug)

> Update NetCDF .jar file on Maven Central
> 
>
> Key: TIKA-1287
> URL: https://issues.apache.org/jira/browse/TIKA-1287
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Ann Burgess
>  Labels: jar, maven, netcdf, tika, unit-test, update
>
> I am working to update the NetCDFParser file.  When using the most-recent 
> .jar file available from http://www.unidata.ucar.edu/ at the command line I 
> receive a note about a depreciated API: 
> javac -classpath 
> ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar
>  org/apache/tika/parser/netcdf/NetCDFParser.java
> Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a 
> deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> After updating the NetCDFParser file with non-deprecated methods (e.x. 
> changing "dimension.getName()" to "dimension.getFullName()") however, I get 
> failed unit tests in maven, which I assume is because the Maven Central Repo 
> has the lapsed version of the .jar file needed for NetCDF files (
> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22)
>  .
> Can anyone provide insight into how I get the updated .jar file into the 
> Maven Central Repository? Is there an alternative method to update Tika so I 
> can run my unit tests in Maven?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983167#comment-13983167
 ] 

Jukka Zitting commented on TIKA-1283:
-

I'm not sure if it's a good idea to extract thumbnail images as regular 
embedded resource. A thumbnail is not similarly a "part of" the document like 
an embedded image or an attached file. Instead a thumbnail is used to 
"describe" or "visualize" a document, and thus would IMHO be better expressed 
as a part of document metadata as suggested in TIKA-90.

> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1277) Magic bytes from Wikipedia

2014-04-23 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1277.
-

   Resolution: Fixed
Fix Version/s: 1.6

The list in Wikipedia is in fact quite incomplete and even has some errors. 
After double-checking the sources I integrated he most prominent bits of 
information to Tika in revision 1589558.

I think we can consider this resolved for now until the Wikipedia magic works 
its way through that page.

> Magic bytes from Wikipedia
> --
>
> Key: TIKA-1277
> URL: https://issues.apache.org/jira/browse/TIKA-1277
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
> Fix For: 1.6
>
>
> Wikipedia has a (currently quite short) list of known byte signatures of 
> various file formats (http://en.wikipedia.org/wiki/List_of_file_signatures). 
> It would be good to verify that Tika correctly recognizes all those formats.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1277) Magic bytes from Wikipedia

2014-04-23 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1277:
---

 Summary: Magic bytes from Wikipedia
 Key: TIKA-1277
 URL: https://issues.apache.org/jira/browse/TIKA-1277
 Project: Tika
  Issue Type: Improvement
  Components: detector
Reporter: Jukka Zitting
Assignee: Jukka Zitting


Wikipedia has a (currently quite short) list of known byte signatures of 
various file formats (http://en.wikipedia.org/wiki/List_of_file_signatures). It 
would be good to verify that Tika correctly recognizes all those formats.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-936) encoding of ZipArchiveInputStream

2014-04-18 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-936:
---

Description: 
When extracting from the zip files which are zipped at Windows OS(Japanese), 
the file name extracted from zip is garbled.

ZipArchiveInputStream has three constructors. Modifying like the below, the 
file name was not garbled. I specified the encoding - SJIS.

{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
 :
 //unpack(new ZipArchiveInputStream(stream), xhtml);  
 unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
 :
{code}

In first constructor -the platform's default encoding- UTF-8 is used.  In my 
case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, so 
the file name was garbled. We will get garbled file name if there is a 
difference of  encoding between -platform- this constructor and zip file.

I want Tika to parse zip by giving some kind of encoding parameter per file, 
Where should I give the encoding, somewhere in Metadata or ParseContext? Please 
support this. I am using Tika via Solr(SolrCell), so when posting zip file to 
Solr I want to add encoding parameter to the request.

  was:
When extracting from the zip files which are zipped at Windows OS(Japanese), 
the file name extracted from zip is garbled.

ZipArchiveInputStream has three constructors. 
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.

{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
 :
 //unpack(new ZipArchiveInputStream(stream), xhtml);  
 unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
 :
{code}

In first constructor -the platform's default encoding- UTF-8 is used. 
In my case the encoding of my computer is UTF-8, the encoding of zip file is 
SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of 
encoding between -platform- this constructor and zip file.

I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata 
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.

   Assignee: Jukka Zitting
 Issue Type: Improvement  (was: Wish)

In revision 1588474 I made it possible to pass a customized 
{{ArchiveStreamFactory}} instance through the parse context. The required 
client code looks like this:

{code}
ArchiveStreamFactory factory = new ArchiveStreamFactory();
factory.setEntryEncoding("SJIS");
context.set(ArchiveStreamFactory.class, factory);
parser.parse(..., context);
{code}

See also the test case I added in {{ZipParserTest}}.

Note that this feature applies also to the other archive types supported by 
Commons Compress. Also, if the UTF-8 flag of a particular zip file is set, then 
the given encoding is ignored and UTF-8 is used to decode entry names within 
that zip file.

> encoding of ZipArchiveInputStream
> -
>
> Key: TIKA-936
> URL: https://issues.apache.org/jira/browse/TIKA-936
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.1
>Reporter: Shinichiro Abe
>Assignee: Jukka Zitting
> Attachments: x-日本語メモ.zip
>
>
> When extracting from the zip files which are zipped at Windows OS(Japanese), 
> the file name extracted from zip is garbled.
> ZipArchiveInputStream has three constructors. Modifying like the below, the 
> file name was not garbled. I specified the encoding - SJIS.
> {code:title=PackageExtractor|borderStyle=solid}
> public void parse(InputStream stream)
>  :
>  //unpack(new ZipArchiveInputStream(stream), xhtml);  
>  unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
>  :
> {code}
> In first constructor -the platform's default encoding- UTF-8 is used.  In my 
> case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, 
> so the file name was garbled. We will get garbled file name if there is a 
> difference of  encoding between -platform- this constructor and zip file.
> I want Tika to parse zip by giving some kind of encoding parameter per file, 
> Where should I give the encoding, somewhere in Metadata or ParseContext? 
> Please support this. I am using Tika via Solr(SolrCell), so when posting zip 
> file to Solr I want to add encoding parameter to the request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1268) Extract images from PDF documents

2014-04-09 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1268.
-

   Resolution: Fixed
Fix Version/s: 1.6
 Assignee: Jukka Zitting

Implemented in revision 1586159 with help from [~shinichiro abe].

> Extract images from PDF documents
> -
>
> Key: TIKA-1268
> URL: https://issues.apache.org/jira/browse/TIKA-1268
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
> Fix For: 1.6
>
>
> It would be nice if images within PDF documents could be extracted much like 
> embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1268) Extract images from PDF documents

2014-04-09 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1268:
---

 Summary: Extract images from PDF documents
 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting


It would be nice if images within PDF documents could be extracted much like 
embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1255) WordExtractor - bold hyperlink not closed properly

2014-03-25 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-1255:


Fix Version/s: (was: 1.5)
   (was: 1.4)
   (was: 1.3)
   (was: 1.2)

> WordExtractor - bold hyperlink not closed properly
> --
>
> Key: TIKA-1255
> URL: https://issues.apache.org/jira/browse/TIKA-1255
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3, 1.4, 1.5
> Environment: Any
>Reporter: Alan Hunter
>Priority: Minor
> Attachments: example.doc
>
>
> If a Word document contains a bold hyperlink, the resulting xhtml is:
>  href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&fileLanguage=en&contentId=126220&from=downloads&viewArchived=false";>Test
>  link
> The closing bold and anchor tags are transposed, which isn't valid XHTML.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1261) Commons Compress version should be 1.5

2014-03-25 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1261.
-

Resolution: Fixed
  Assignee: Jukka Zitting

Fixed in revision 1581402.

> Commons Compress version should be 1.5
> --
>
> Key: TIKA-1261
> URL: https://issues.apache.org/jira/browse/TIKA-1261
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ryan Quam
>Assignee: Jukka Zitting
>Priority: Minor
>
> I was building from the 1.5 maintenance branch and noticed that revision 
> 1574922 changed the version of commons-compress to 1.5.1-SNAPSHOT.  I think 
> the version of that artifact should be reverted back to 1.5 in 
> tika-parsers/pom.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1262) parseToString fails to detect content-type / charset

2014-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941336#comment-13941336
 ] 

Jukka Zitting commented on TIKA-1262:
-

The {{CharsetDetector}} class detects the character encoding based on the 
assumption that the given input is already known to be plain text in some 
encoding. Unfortunately this assumption won't hold for methods like 
{{Tika.parseToString()}} that need to also deal with binary file formats, which 
is why we currently can't auto-detect such documents.

What you could do here is pass the filename as input metadata to the parser, in 
which case it could make the assumption about the file being plain text. The 
easiest way to do this would be to use the {{TikaInputStream.get()}} factory 
method to collect the input metadata, like this:

{code}
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
{code}

(note the extra {{metadata}} argument)

More generally it would be possible to extend the existing {{TextStatistics}} 
class with information about the byte patterns used by the EUC-CN  (and KOI8-R) 
encoding, ideally with character usage statistics like 
http://www.zein.se/patrick/3000char.html (and 
http://www.sttmedia.com/characterfrequency-russian) to make the heuristics more 
accurate. With such information the {{TextDetector}} class should be able to 
detect more encodings than just the ASCII-based ones and UTF-8 it currently 
knows about.

> parseToString fails to detect content-type / charset
> 
>
> Key: TIKA-1262
> URL: https://issues.apache.org/jira/browse/TIKA-1262
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.5
> Environment: Java 1.7; Windows 7 64 bit
>Reporter: Jeremy McLain
> Attachments: ChineseTextExtraction.java, GB2312.txt, 
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 
> 'application/octet-stream' for the Content-Type and returns an empty string 
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a 
> common charset. It appears that this isn't just a GB2312 issue. It seems to 
> work fine with ISO-8859-1 (English) files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1260) Detection result for zero-byte files is text/plain

2014-03-17 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1260.
-

   Resolution: Not A Problem
Fix Version/s: (was: 1.5)

What you're seeing is the result of using the file name as a hint of the type 
of the file. If the file name ends in {{.txt}} or some similar suffix, it 
probably should be treated as a text file, even if it doesn't contain anything. 
Only when no such hints are available will Tika fall back to 
{{application/octet-stream}}. See:

{code}
$ touch empty.txt
$ java -jar tika-app-1.5.jar --detect empty.txt
text/plain
$ java -jar tika-app-1.5.jar --detect < empty.txt
application/octet-stream
{code}

> Detection result for zero-byte files is text/plain
> --
>
> Key: TIKA-1260
> URL: https://issues.apache.org/jira/browse/TIKA-1260
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.5
> Environment: Linux Mint 16 
>Reporter: Johan van der Knijff
>Priority: Minor
>  Labels: empty, zero-length
>
> Running Tika with the -d (detection) option, any zero-byte files are 
> identified as "text/plain". I'm wondering if this is the intended behavior? I 
> know the Unix File tool reports "inode/x-empty" in such cases. Perhaps Tika 
> should do this as well?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1219) Add .svn to .gitignore

2014-01-14 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1219.
-

   Resolution: Not A Problem
Fix Version/s: (was: 1.5)

> Add .svn to .gitignore
> --
>
> Key: TIKA-1219
> URL: https://issues.apache.org/jira/browse/TIKA-1219
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
> Attachments: TIKA-1219.patch
>
>
> This is for folks who may be working on TIKA issues on their own Git 
> branches. It is an extremely trivial change.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1219) Add .svn to .gitignore

2014-01-14 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870844#comment-13870844
 ] 

Jukka Zitting commented on TIKA-1219:
-

There actually is a better way. :-) You can clone Tika from 
https://github.com/apache/tika and use it to keep track of the latest trunk, as 
the GitHub mirror is automatically kept up to date. See also 
http://git.apache.org/.

> Add .svn to .gitignore
> --
>
> Key: TIKA-1219
> URL: https://issues.apache.org/jira/browse/TIKA-1219
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-1219.patch
>
>
> This is for folks who may be working on TIKA issues on their own Git 
> branches. It is an extremely trivial change.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1219) Add .svn to .gitignore

2014-01-14 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870826#comment-13870826
 ] 

Jukka Zitting commented on TIKA-1219:
-

Why would you have a {{.svn}} directory if you're using a Git clone?

> Add .svn to .gitignore
> --
>
> Key: TIKA-1219
> URL: https://issues.apache.org/jira/browse/TIKA-1219
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-1219.patch
>
>
> This is for folks who may be working on TIKA issues on their own Git 
> branches. It is an extremely trivial change.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-13 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869780#comment-13869780
 ] 

Jukka Zitting commented on TIKA-1217:
-

Thanks! I committed the patch in revision 1557795.

Before we resolve this as fixed, I think it would be better to use 
{{detect(File)}} directly, without first trying type detection based on just 
the file name. Name-based type lookup is less accurate than file-based, and 
AFAICT there are few applications where  file type lookup is 
performance-critical (and if it is, like in a directory browser, the results 
are often cached). And assuming the file-based lookup was conditional on the 
existence of the file, a client could still do name-only lookups by prepending 
a fictional directory name to the path passed to the {{probeContentType()}} 
call. WDYT?

> Integrate with Java-7 FileTypeDetector API
> --
>
> Key: TIKA-1217
> URL: https://issues.apache.org/jira/browse/TIKA-1217
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime
>Reporter: Peter Ansell
> Attachments: TIKA-1217-v2.patch, TIKA-1217.patch
>
>
> It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
> implementations. Adding the corresponding 
> META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
> use of Files.probeContentType [2] without any specific links to Tika for this 
> functionality.
> If you do not want to rely on Java-7 for the core, then this could be added 
> as an extension module.
> [1] 
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
> [2] 
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1214) Infinity Loop in Mpeg Stream

2014-01-13 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1214.
-

   Resolution: Duplicate
Fix Version/s: (was: 1.5)

Resolving as duplicate of TIKA-1179. Please reopen, preferably with a test 
case/document, if this problem still occurs.

> Infinity Loop in Mpeg Stream
> 
>
> Key: TIKA-1214
> URL: https://issues.apache.org/jira/browse/TIKA-1214
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: local system
>Reporter: Georg Hartmann
>
> Scanning MP3 Files accounter a infiniy loop in the MpegStream Method 
> skipStream
> The Call of in.skip returnes zero so the loop never ends.
> Simple fix with zero count below
> private static void skipStream(InputStream in, long count) throws 
> IOException {
> long size = count;
> long skipped = 0;
> // 5 Times zero equals Error break the loop
> int zeroCount = 5;
> while (size > 0 && skipped >= 0) {
> skipped = in.skip(size);
> if (skipped != -1) {
> size -= skipped;
> }
> 
> // Checking for zero to break the infinity loop
> if (skipped == 0) {
> zeroCount--;
> }
> if (zeroCount < 0) {
> break;
> }
> }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1215.
-

Resolution: Not A Problem

You're misusing the {{ToHTMLContentHandler}} class:

{code}
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, "UTF-8");
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);
{code}

The {{ToHTMLContentHandler}} javadoc says:

bq. The incoming SAX events are expected to be well-formed (properly nested, 
etc.) and valid HTML.

This is not true since you're using the {{BodyContentHandler}} to strip out 
anything outside the {{}} element.

Thus resolving as Not A Problem. If you want to format the parse output as 
HTML, you should pass the {{ToHTMLContentHandler}} directly to the parser, 
without the {{BodyContentHandler}} wrapper.

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception

2014-01-13 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869662#comment-13869662
 ] 

Jukka Zitting commented on TIKA-1218:
-

Reproduced. It looks like the last frame that Tika can interpret is a TRAKTOR4 
PRIV frame at offset 114 with size of 335387 bytes. It could be that this frame 
is malformed (wrong size, etc.), or there might be a bug in the way Tika 
handles the frame. The tooling at http://dope.cz/code/ might be helpful in 
debugging this case.

> Unable to parse a mp3 file on 1.5 getting a exception
> -
>
> Key: TIKA-1218
> URL: https://issues.apache.org/jira/browse/TIKA-1218
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
> Environment: Win 7, Java 1.7
>Reporter: Sumeet Gorab
>Priority: Blocker
> Attachments: Save-the-World-Knife-Party-Remix.mp3
>
>
> Unable to parse a mp3 file on 1.5 getting following exception:
> Exception in thread "main" java.lang.NegativeArraySizeException
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTag.(ID3v2Frame.java:417)
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTag.(ID3v2Frame.java:382)
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371)
>   at 
> org.apache.tika.parser.mp3.ID3v24Handler.(ID3v24Handler.java:49)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866722#comment-13866722
 ] 

Jukka Zitting commented on TIKA-1217:
-

Nice idea!

I think putting such a feature to a separate tika-java7 component (included in 
the build only when using Java 7 or higher) for now is the best solution, as 
otherwise we'd need to raise the requirements on build environments. Once we do 
do that at some point in future, the component can be merged into tika-core.

> Integrate with Java-7 FileTypeDetector API
> --
>
> Key: TIKA-1217
> URL: https://issues.apache.org/jira/browse/TIKA-1217
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime
>Reporter: Peter Ansell
>
> It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
> implementations. Adding the corresponding 
> META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
> use of Files.probeContentType [2] without any specific links to Tika for this 
> functionality.
> If you do not want to rely on Java-7 for the core, then this could be added 
> as an extension module.
> [1] 
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
> [2] 
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1160) Add support for SolidWorks files

2013-12-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1160.
-

Resolution: Fixed

Excellent, thanks! I committed the latest patch and test files in revision 
1553779.

> Add support for SolidWorks files
> 
>
> Key: TIKA-1160
> URL: https://issues.apache.org/jira/browse/TIKA-1160
> Project: Tika
>  Issue Type: Wish
>  Components: mime
>Affects Versions: 1.4
>Reporter: gunter rombauts
>  Labels: patch
> Fix For: 1.5
>
> Attachments: custom-mimetypes.xml, solidworks2013SP2.zip, 
> solidworks2014SP0.zip, solidworks_patch1.zip
>
>
> It would be an advantage if the mime type for SolidWorks files could be 
> detected by tika. File extensions include *slddrw, *sldasm, *.sldasm.
> Standard properties are store in office alike format.
> Custom properties are not detected.
> I will include a custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-12-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1193.
-

Resolution: Fixed

The patch is perfect, thanks! Committed in revision 1553774.

> Allow access to HtmlParser's HtmlSchema
> ---
>
> Key: TIKA-1193
> URL: https://issues.apache.org/jira/browse/TIKA-1193
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Jukka Zitting
> Fix For: 1.5
>
> Attachments: TIKA-1193-trunk.patch, TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of 
> correctly handling some very strange quirks, e.g. table inside anchors. By 
> allowing access to the schema applications can modify the schema to suit 
> their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting 
> committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-245) Support of CHM Format

2013-12-27 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857520#comment-13857520
 ] 

Jukka Zitting commented on TIKA-245:


bq.  tika is not able to extract contents from chm documents

This was probably due to TIKA-1110, now fixed.

> Support of CHM Format
> -
>
> Key: TIKA-245
> URL: https://issues.apache.org/jira/browse/TIKA-245
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
> Environment: All
>Reporter: Karl Heinz Marbaise
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.10
>
> Attachments: TIKA-245.oleg.20110806.PATCH, 
> TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, 
> TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
>
>
> It might be a good idea to support the CHM File format of Windows. Some 
> information about 
> http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. 
> The CHM format contains HTML files which can be parsed by Tika. So the "only" 
> problem is to extract the data from the CHM file.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1122) Tika fails to parse chm files

2013-12-27 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1122.
-

   Resolution: Duplicate
Fix Version/s: (was: 1.5)

Seems like a duplicate of TIKA-1110.

> Tika fails to parse chm files
> -
>
> Key: TIKA-1122
> URL: https://issues.apache.org/jira/browse/TIKA-1122
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Tejas Patil
>Priority: Minor
>
> (reported by Jan Riewe over nutch user group, see 
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
> application/vnd.ms-htmlhelp
> Even after running tika-app in standalone manner (ie. not via nutch), I could 
> see not even a single chm file being parsed (I tried with 10-15 different chm 
> files of variable sizes).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-12-26 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting reassigned TIKA-1193:
---

Assignee: Jukka Zitting

> Allow access to HtmlParser's HtmlSchema
> ---
>
> Key: TIKA-1193
> URL: https://issues.apache.org/jira/browse/TIKA-1193
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Jukka Zitting
> Fix For: 1.5
>
> Attachments: TIKA-1193-trunk.patch, TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of 
> correctly handling some very strange quirks, e.g. table inside anchors. By 
> allowing access to the schema applications can modify the schema to suit 
> their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting 
> committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1210) Address tika-parsers o.a.t.mime.TestMimeTypes TODO: Need a test flash file

2013-12-26 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1210.
-

Resolution: Fixed
  Assignee: Jukka Zitting

Thanks! Patch and test files committed in revision 1553622.

> Address tika-parsers o.a.t.mime.TestMimeTypes TODO: Need a test flash file
> --
>
> Key: TIKA-1210
> URL: https://issues.apache.org/jira/browse/TIKA-1210
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-1210.patch, test1.swf, test2.swf, test3.swf
>
>
> AFAICS the TODO relates to the absence of suitable .swf/.SWF files to use 
> within the assertTypeByData(String st, String st) method.
> Over in Nutch we currently run some tests and have available .swf files which 
> can be used within Tika.
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/parse-swf/sample/
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-26 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1152.
-

Resolution: Fixed
  Assignee: Jukka Zitting

Thanks! Patch committed in revision 1553621.

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>Reporter: Hong-Thai Nguyen
>Assignee: Jukka Zitting
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-26 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1213.
-

Resolution: Not A Problem

Resolving as Not A Problem based on the discussion in PDFBOX-1821.

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1110) Incorrectly declared SUPPORTED_TYPES in ChmParser.

2013-12-26 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1110.
-

Resolution: Fixed
  Assignee: Jukka Zitting

Thanks, Vadim! I committed your patch (with spaces instead of tabs) in revision 
1553617.

I also made a few follow-up improvements (and added a test case) in revisions 
1553618 and 1553619 to make the integration with AutoDetectParser work more 
smoothly.

> Incorrectly declared SUPPORTED_TYPES in ChmParser.
> --
>
> Key: TIKA-1110
> URL: https://issues.apache.org/jira/browse/TIKA-1110
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Andrzej Bialecki 
>Assignee: Jukka Zitting
> Fix For: 1.5
>
> Attachments: TIKA-1110.patch
>
>
> [This 
> link|http://www.iana.org/assignments/media-types/application/vnd.ms-htmlhelp] 
> assigns the official mime type for these files to 
> "application/vnd.ms-htmlhelp". In the wild there are also two other types 
> used:
> * application/chm
> * application/x-chm
> tika-mimetypes.xml uses the correct official mime type, but ChmParser 
> declares that it supports only "application/chm". For this reason content 
> that uses the official mime type (e.g. coming via Detector or parsed using 
> AutoDetectParser, or simply declared in metadata) fails to parse due to 
> unknown mime type.
> The fix seems simple - ChmParser should declare also all of the above types 
> in its SUPPORTED_TYPES.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-18 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825832#comment-13825832
 ] 

Jukka Zitting commented on TIKA-1193:
-

A cleaner approach would probably be to allow the caller to pass a custom 
schema through the ParseContext object:

{code}
ParseContext context = new ParseContext();
context.set(Schema.class, ...);
parser.parse(..., context);
{code}

The {{HtmlParser}} class could then get the custom schema from the context:

{code}
Schema schema = context.get(Schema.class, HTML_SCHEMA);
parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty, schema);
{code}

> Allow access to HtmlParser's HtmlSchema
> ---
>
> Key: TIKA-1193
> URL: https://issues.apache.org/jira/browse/TIKA-1193
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.5
>
> Attachments: TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of 
> correctly handling some very strange quirks, e.g. table inside anchors. By 
> allowing access to the schema applications can modify the schema to suit 
> their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting 
> committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1190) ZipContainerDetector.detect() can spool the entire stream to a temporary file

2013-11-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811392#comment-13811392
 ] 

Jukka Zitting commented on TIKA-1190:
-

bq. Isn't the right fix then to pull out that part of the detector to a new one?

Right, we could do that.

The reason I'm hesitant about that approach is that the way I've thought about 
the Detector mechanism is that it's always guaranteed to be an {{O(1)}} 
operation (that was one of my original design goals for the interface), i.e. 
independent of the size of the input document. The current behavior makes it a 
potentially {{O(n\)}} operation, which was quite surprising at least to me in a 
case where we were using Tika.detect() on a large ZIP archive in transit over 
the network.

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -
>
> Key: TIKA-1190
> URL: https://issues.apache.org/jira/browse/TIKA-1190
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls 
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP 
> archive) without using the {{hasFile()}} method to check whether a backing 
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the 
> entire stream getting spooled to a temporary file that might not be needed at 
> all after the detection.
> A better approach would be to only do the more detailed "full file" format 
> detection if the backing file is already available, i.e. if {{hasFile()}} 
> returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1190) ZipContainerDetector.detect() can spool the entire stream to a temporary file

2013-11-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811245#comment-13811245
 ] 

Jukka Zitting commented on TIKA-1190:
-

bq. We need to buffer it if it's a stream, otherwise the data won't be there 
for the parser!

The parser can still buffer the stream to a temporary file even if the detector 
doesn't do that. The only limitation for the parsing use case would be that the 
AutoDetectParser might not be able to directly dispatch the document to the 
correct parser, but it should be possible to work around that by doing the more 
detailed type detection in PackageParser and re-dispatching the parsing of the 
document if a more specific container format is detected.

bq. Isn't the right fix for people to just skip that Detector if they don't 
want the whole file used?

Doing so would also drop the advanced type header detection by 
commons-compress. That detection code doesn't need the whole file, but is also 
too complex to express in the MIME magic database.

bq. I worry that people will get very confused if some kinds of TikaInputStream 
do correct detection, and others don't

We already have the case that some kinds of InputStreams do correct detection 
and others don't, and that seems to work just fine. Instead of saying to people 
that just passing a TikaInputStream will give you advanced detection, it's IMHO 
better to explain that advanced type detection is possible when the document is 
available as a random-access file wrapped to a TikaInputStream. And it would 
still possible for people to force the spooling (and thus enable the detailed 
zip detection) by calling TikaInputStream.getFile() before passing the stream 
to a detector.

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -
>
> Key: TIKA-1190
> URL: https://issues.apache.org/jira/browse/TIKA-1190
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls 
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP 
> archive) without using the {{hasFile()}} method to check whether a backing 
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the 
> entire stream getting spooled to a temporary file that might not be needed at 
> all after the detection.
> A better approach would be to only do the more detailed "full file" format 
> detection if the backing file is already available, i.e. if {{hasFile()}} 
> returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (TIKA-1190) ZipContainerDetector.detect() can spool the entire stream to a temporary file

2013-10-31 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1190:
---

 Summary: ZipContainerDetector.detect() can spool the entire stream 
to a temporary file
 Key: TIKA-1190
 URL: https://issues.apache.org/jira/browse/TIKA-1190
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Jukka Zitting
Assignee: Jukka Zitting


As noted in a TODO comment, currently the {{ZipContainerDetector}} calls 
{{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP 
archive) without using the {{hasFile()}} method to check whether a backing file 
is actually available.

This is troublesome as it can lead to unexpected performance loss due to the 
entire stream getting spooled to a temporary file that might not be needed at 
all after the detection.

A better approach would be to only do the more detailed "full file" format 
detection if the backing file is already available, i.e. if {{hasFile()}} 
returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-10-31 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810987#comment-13810987
 ] 

Jukka Zitting commented on TIKA-817:


The tests were failing on Windows due to line ending differences in the 
generated XML output. I fixed that in revision 1537803.

> (PPT/PPTX) Missing date/time in text content.
> -
>
> Key: TIKA-817
> URL: https://issues.apache.org/jira/browse/TIKA-817
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.0
> Environment: Win7-64 + java version "1.6.0_26"
>Reporter: Albert L.
> Fix For: 1.5
>
>
> Missing date/time text in text content for PPT and PPTX files.
> The date and time are missing from the text content.  This occurs when one 
> chooses the following with MS-PowerPoint 2010:
> 1) "Insert"
> 2) "Date & Time"
> 3) "Update automatically"
> 4) save to PPT or PPTX



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (TIKA-1149) Improve parser lookup performance

2013-08-05 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-1149:


Attachment: 0001-TIKA-1149-Improve-parser-lookup-performance.patch

See the attached patch for a quick draft (doesn't pass all tests yet) of what 
this could look like.

The patch relies on a new SimpleParser class that would replace AbstractParser 
as the base class for all "simple" parsers that support a fixed set of 
(normalized) media types regardless of the context. Such an extra constraint 
allows CompositeParser to speed up parser lookups as it doesn't need to build 
the full parser map on each access.

> Improve parser lookup performance
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch, 
> CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1149) Improve parser lookup performance

2013-08-01 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-1149:


Summary: Improve parser lookup performance  (was: 12% performance 
improvement by caching in CompositeParser)

> Improve parser lookup performance
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715180#comment-13715180
 ] 

Jukka Zitting commented on TIKA-1149:
-

Note that for example {{DefaultParser.getParsers(ParseContext)}} can return a 
different set of parsers on each invocation, thanks to the dynamic service 
lookup mechanism in {{ServiceLoader}}. Thus caching the return value can lead 
to incorrect behavior.

An alternative optimization would be to refactor the 
{{CompositeParser.getParser(Metadata, ParseContext)}} method so that it doesn't 
need to always instantiate the full type->parser map. Instead it could for 
example restrict the search to only the specified type and its supertypes.

> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

2013-05-14 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-881.


Resolution: Duplicate

This has been fixed meanwhile with the AutoDetectReader class that the 
HtmlParser is now using to detect the content encoding.

> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --
>
> Key: TIKA-881
> URL: https://issues.apache.org/jira/browse/TIKA-881
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.0
> Environment: Windows7, JDK1.5, JDK1.6
>Reporter: Klaus v. Einem
>  Labels: stability
> Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out 
> of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
>   at java.io.BufferedInputStream.reset(Unknown Source)
>   at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
>   at 
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
>   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read 
> position is marked and the readlimit (maximum number of bytes to be read 
> before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check 
> the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() 
> on the stream will throw the Exception because the mark position gets 
> invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1103) Tika.parseToString(InputStream) does not output the same content as parseToString(File)

2013-04-11 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628753#comment-13628753
 ] 

Jukka Zitting commented on TIKA-1103:
-

It looks like the mentioned PDF starts with a single line-feed character (ASCII 
code 10) before the normal PDF prefix of "%PDF-...", which makes Tika's 
automatic type detection fail to recognize the file as a PDF document.

> Tika.parseToString(InputStream) does not output the same content as 
> parseToString(File)
> ---
>
> Key: TIKA-1103
> URL: https://issues.apache.org/jira/browse/TIKA-1103
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.1, 1.2, 1.3
> Environment: Windows 7 x64
> Java 6 Update 33
>Reporter: Antoine Libert
>
> Tika.parseToString(...) outputs different results with the following PDF file 
> (iPhone user guide in german, bug also happens with french).
> http://manuals.info.apple.com/de_DE/iphone_benutzerhandbuch.pdf
> 1.3 parseToString(File) : actual content (good)
> 1.2 parseToString(File) : actual content (good)
> 1.1 parseToString(File) : actual content (good)
> 1.3 parseToString(InputStream) : empty
> 1.2 parseToString(InputStream) : PDF binary shown as text
> 1.1 parseToString(InputStream) : PDF binary shown as text
> Simple test case:
> Tika tika = new Tika();
> File f = new File("iphone_benutzerhandbuch.pdf")
> TikaInputStream is2 = TikaInputStream.get(f);
> String st2 = tika.parseToString(is2); // inputstream
> String stt2 = tika.parseToString(f); // file
> assertTrue(st2.equals(stt2)); // false

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1101) XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was referenced, but not declared

2013-04-04 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623396#comment-13623396
 ] 

Jukka Zitting commented on TIKA-1101:
-

We already have 
[heuristics|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4506]
 for detecting fragments that start with a tag like {{}} or {{}} as 
HTML instead of XML. Perhaps we should add {{}} to that list.

> XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was 
> referenced, but not declared
> --
>
> Key: TIKA-1101
> URL: https://issues.apache.org/jira/browse/TIKA-1101
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2, 1.3
> Environment: I'm using solr 4.0 final with tika 1.2 and ManifoldCF 
> v1.2 dev on tomcat 7 (RHL)
>Reporter: David Morana
>Assignee: Ken Krugler
> Fix For: 1.2, 1.3
>
>
> Good afternoon,
> This web page (see below) when crawled by ManifoldCF causes severe errors in 
> Solr and causes ManifoldCF to abort the current job.
> I verified the error by sending the URL to tika-app 1.2 and 1.3.
> I can't find any kind of a fix for this.
> Please advise...
> P.S. can you also provide a list of all tika supporting jars? (i.e. poi, 
> jempbox etc etc)
> Thanks,
> Here's the HTML
> {code}
> 
> 
>  Security and Information 
> Sciences Home ›
>  href="/mission/sec/publications/-publications.html">Publications ›
>  
>  href="/mission/sec/corpora/corpora.html">Corpora › 
> Software 
> Tools ›
>  Systems and 
> Operations ›
>   
>  href="/mission/sec/publications/-publications.html">Publications 
> ›
>  href="/mission/sec/CSO/biographies/CSObios.html">Biographies ›
>   
> 
>  Systems and 
> Technology › 
>  System 
> Assessments › 
>   Human Language 
> Technology ›
> Computing and 
> Analytics ›
>   
> 
> {code}
> Here's the error:
> {code}
> Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: XML parse error
>   at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
>   at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>   at 
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
>   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.solr.handler.extra

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584294#comment-13584294
 ] 

Jukka Zitting commented on TIKA-1074:
-

bq. Wait, do you mean I should remove the handling entirely (not bother future 
proofing)?

If POI decides to declare IE (or just generic Exception) as thrown by their 
API, it'll break binary compatibility, and thus in any case we'll need to 
adjust our code. So adding future proofing code here doesn't win us anything, 
it just complicates the codebase for no gain.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584229#comment-13584229
 ] 

Jukka Zitting commented on TIKA-1074:
-

bq. InterruptedException is never thrown in these places today, so I can't add 
the separate catch clause (compiler is angry).

It's a checked exception, so if it isn't declared to be thrown by POI, it 
shouldn't get thrown here (even though the VM doesn't strictly prohibit that). 
So in that case the extra check shouldn't even be needed.

bq. I think it's cleaner to set the interrupt bit and let the next place that 
waits see the interrupt bit and throw IE?

I don't really like this approach. We're essentially saying: "Yes, you asked me 
to stop what I'm doing, but instead I'll just finish up what I was doing and 
ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to 
stop immediately, either by letting the IE bubble up or (where necessary) by 
throwing a TikaException that wraps the IE.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-21 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584034#comment-13584034
 ] 

Jukka Zitting commented on TIKA-1074:
-

If we get an InterruptedException, then we shouldn't just log it and continue. 
I'd wrap it to a TikaException and re-throw.

Also instead of:

{code}
} catch (Exception e) {
  ...
  if (e instanceof SomeException) {
  ...
  }
}
{code}

a better pattern might be:

{code}
} catch (SomeException e) {
  ...
} catch (Exception e) {
  ...
}
{code}


> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1080) Arabic characters under windows

2013-02-07 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573496#comment-13573496
 ] 

Jukka Zitting commented on TIKA-1080:
-

If you don't provide an option like --encoding=UTF-8 on the tika-app command 
line, the text mode will assume that you'll be using the default encoding of 
the system (as reported by the Java runtime). Any characters not supported by 
that encoding will end up as question marks. The simple fix here is to 
explicitly provide the desired encoding with such a command line option.

In contrast HTML output defaults to UTF-8 (though you can override it with the 
--encoding option), since unlike with plain text, HTML clients can 
automatically extract the encoding information from the head of the document 
and thus display the content correctly.


> Arabic characters under windows
> ---
>
> Key: TIKA-1080
> URL: https://issues.apache.org/jira/browse/TIKA-1080
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Affects Versions: 1.3
> Environment: Windows 2003 or Windows 2008
>Reporter: Alberto Ornaghi
> Attachments: arabic.docx
>
>
> If tika is executed under windows the text mode (--text) is failing to 
> extract arabic chars and outputs only question marks. The same behaviour 
> occurs if tika is executed as a server. The issue is not present in the GUI, 
> only commandline. The issue is not present if the output is html.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1062) Add list detection to RTFParser

2013-01-24 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562457#comment-13562457
 ] 

Jukka Zitting commented on TIKA-1062:
-

bq. coding style

We've generally tried to stick with the [standard Java 
conventions|http://www.oracle.com/technetwork/java/javase/documentation/codeconvtoc-136057.html]
 (with spaces instead of tabs), but haven't been too fundamental about that. If 
you write new code, you get to decide how it looks like (within reason :-). If 
you modify existing code, try to stick with the existing style.

> Add list detection to RTFParser
> ---
>
> Key: TIKA-1062
> URL: https://issues.apache.org/jira/browse/TIKA-1062
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Axel Dörfler
>Assignee: Michael McCandless
>Priority: Minor
>  Labels: patch
> Fix For: 1.4
>
> Attachments: testRTFListLibreOffice.rtf, 
> testRTFListMicrosoftWord.rtf, tika-rtf-lists.patch
>
>
> RTF supports lists, and the parser could support those, too, using HTML 
> // tags.
> I'm attaching a patch that implements basic support for Word 97 and newer 
> lists. Nested lists are not supported correctly, yet, though, and a number of 
> formatting options are ignored.
> I've also added test cases for this, and adapted existing tests where needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1060) Degrade gracefully when juniversalchardet not present

2013-01-20 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1060.
-

   Resolution: Fixed
Fix Version/s: 1.4

Fixed in revision 1436209.

> Degrade gracefully when juniversalchardet not present
> -
>
> Key: TIKA-1060
> URL: https://issues.apache.org/jira/browse/TIKA-1060
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.4
>
>
> The UniversalEncodingDetector class used by TXTParser depends on the presence 
> of the juniversalchardet library. The detector tries to fail gracefully when 
> the library is not present by catching unexpected exceptions, but in fact 
> that case leads to a LinkageError that is not caught by the current code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1060) Degrade gracefully when juniversalchardet not present

2013-01-20 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1060:
---

 Summary: Degrade gracefully when juniversalchardet not present
 Key: TIKA-1060
 URL: https://issues.apache.org/jira/browse/TIKA-1060
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor


The UniversalEncodingDetector class used by TXTParser depends on the presence 
of the juniversalchardet library. The detector tries to fail gracefully when 
the library is not present by catching unexpected exceptions, but in fact that 
case leads to a LinkageError that is not caught by the current code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-775) Embed Capabilities

2012-12-14 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532205#comment-13532205
 ] 

Jukka Zitting commented on TIKA-775:


bq. {{ catch (InterruptedException ignore) { 
Thread.currentThread().interrupt(); } }}

Resetting the interrupt status of the thread is a good idea when we can't just 
re-throw the InterruptedException, but IMHO we should in that case *also* throw 
a TikaException to indicate that the requested operation couldn't be completed 
normally:

{code}
...
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new TikaException("Interrupted", e);
}
...
{code}

When the {{Process.waitFor()}} call is interrupted we can't just blindly assume 
that the process completed normally. Explicitly throwing an exception is a 
better approach in such cases.

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-13 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1041.
-

   Resolution: Fixed
Fix Version/s: (was: 1.2)
 Assignee: Jukka Zitting

I fixed this in revision 1421141 by catching the NoClassDefFoundError and just 
ignoring the missing functionality when the required dependency is not present. 
A deployment can pass in a ServiceLoader with a custom LoadErrorHandler through 
the ParseContext to log or otherwise handle such dependency issues.

> Tika 1.2 universalcharset errors
> 
>
> Key: TIKA-1041
> URL: https://issues.apache.org/jira/browse/TIKA-1041
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2
> Environment: I'm running solr 4.0 with tika 1.2 on tomcat 7.0.8 with 
> manifoldcf v1.1dev 
>Reporter: David Morana
>Assignee: Jukka Zitting
> Fix For: 1.3
>
>
> This is somewhat confusing and frustrating. I successfully crawled Opentext 
> using all of the above. then I recrawled and it aborted almost immediately.
> It choked on images, so I excluded them for now. 
> but now it's choking on txt files! 
> sometimes I get this error
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/mozilla/universalchardet/CharsetListener
> and sometimes I get this one
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/apache/tika/parser/txt/UniversalEncodingListener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1034) MimeTypes seems to be doing unnecessary work in the detect method

2012-11-29 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1034.
-

Resolution: Won't Fix

See the {{Detector}} javadocs. You can pass {{null}} as the {{InputStream}} in 
such cases. If you do pass a non-{{null}} stream, it needs to support the 
mark/reset feature (you'll need to wrap the stream in {{TikaInputStream}} or 
{{BufferedInputStream}} if necessary).

The reason why we only check the type hint from the input metadata after trying 
the other detection methods is that often such type hints (for example coming 
from a remote web server) are not very accurate. Thus we only use them if a 
more specific type can't automatically be detected.

Resolving as Won't Fix.

> MimeTypes seems to be doing unnecessary work in the detect method
> -
>
> Key: TIKA-1034
> URL: https://issues.apache.org/jira/browse/TIKA-1034
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.2
>Reporter: Bice Dibley
>
> The final section of MimeTypes.detect is always used to set the type if 
> provided in the metadata, but does this after using two other resolution 
> strategies. Would it be possible to move the following to the top of the 
> detect method
> {code}
> // Get type based on metadata hint (if available)
> String typeName = metadata.get(Metadata.CONTENT_TYPE);
> if (typeName != null) {
> try {
>  MediaType hint = forName(typeName).getType();
>  if (registry.isSpecializationOf(hint, type)) {
> type = hint;
>  }
> } catch (MimeTypeException e) {
> // Malformed type name, ignore
> }
> {code}
> and if the type is successfully set, return at that point rather than 
> continuing with the other resolution strategies?
> The reason I ask is that I'm experiencing a problem with MimeType.detect 
> causing the stream to be closed because the type of the stream being used is 
> read-once and so doesn't support mark/reset. However, I am passing the 
> content type of the file in as part of the metadata, so the detect method 
> shouldn't need to read from the stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1027) Allow null values when setting metadata

2012-11-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500330#comment-13500330
 ] 

Jukka Zitting commented on TIKA-1027:
-

Hmm, good point. I'd argue that the simplicity and safety of not having to 
worry about null values outweighs the benefits of using nulls in the Embedder, 
but then I'm not using Embedder yet so I don't have a balanced view on the 
issue.

A somewhat hacky alternative to using null values to signify removals would be 
to pass a set of field names to remove, either as a separate argument or as a 
custom multi-valued metadata field.

> Allow null values when setting metadata
> ---
>
> Key: TIKA-1027
> URL: https://issues.apache.org/jira/browse/TIKA-1027
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.3
>
>
> Many file formats have optional metadata entries that result in a lot of {{if 
> (value == null)}} statements around {{metadata.set(...)}} calls. It would be 
> more convenient if the {{set}} method explicitly allowed a {{null}} value and 
> that a call like {{metadata.set(name, null)}} was essentially equivalent to a 
> {{metadata.remove(name)}} call.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1027) Allow null values when setting metadata

2012-11-19 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1027.
-

Resolution: Fixed

Done in revision 1411237.

> Allow null values when setting metadata
> ---
>
> Key: TIKA-1027
> URL: https://issues.apache.org/jira/browse/TIKA-1027
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.3
>
>
> Many file formats have optional metadata entries that result in a lot of {{if 
> (value == null)}} statements around {{metadata.set(...)}} calls. It would be 
> more convenient if the {{set}} method explicitly allowed a {{null}} value and 
> that a call like {{metadata.set(name, null)}} was essentially equivalent to a 
> {{metadata.remove(name)}} call.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-775) Embed Capabilities

2012-11-19 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting reopened TIKA-775:



There's a few problems with the implementation.

* The ExternalEmbedderTest fails in a plain Windows environment since it can't 
find {{sed}}. I added a workaround in revision 1411238 that simply disables the 
test on Windows.
* It would be better if ExternalEmbeddedTest was located in {{tika-core}} along 
with the ExternalEmbedder class itself. The use of TXTParser in the test case 
seems unnecessary.
* More generally the test case is quite complicated. Is it being reused 
elsewhere, or can we simplify it? I'd just drop all the extra logging, error 
handling and flag variables.
* The ExternalEmbedder class also seems quite complicated, though I notice much 
of it comes from ExternalParser. Can we for example refactor the common bits to 
a shared base class?
* See the ExternalParser class for how you can (and should) use the 
TemporaryResources class to avoid all the complex cleanup logic. Used properly, 
the {{dispose()}} method takes care of all that.
* It's usually a bad idea to capture InterruptedException and just ignore it. 
Throwing the exception (possibly wrapped into a TikaException) is probably a 
better approach.

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1027) Allow null values when setting metadata

2012-11-19 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1027:
---

 Summary: Allow null values when setting metadata
 Key: TIKA-1027
 URL: https://issues.apache.org/jira/browse/TIKA-1027
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor
 Fix For: 1.3


Many file formats have optional metadata entries that result in a lot of {{if 
(value == null)}} statements around {{metadata.set(...)}} calls. It would be 
more convenient if the {{set}} method explicitly allowed a {{null}} value and 
that a call like {{metadata.set(name, null)}} was essentially equivalent to a 
{{metadata.remove(name)}} call.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1026) ServiceLoader should respect OSGi service ranking

2012-11-19 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1026.
-

   Resolution: Fixed
Fix Version/s: 1.3

Done in revision 148.

> ServiceLoader should respect OSGi service ranking
> -
>
> Key: TIKA-1026
> URL: https://issues.apache.org/jira/browse/TIKA-1026
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 1.2
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>  Labels: osgi
> Fix For: 1.3
>
>
> Currently the ServiceLoader and TikaActivator classes simply maintain an 
> unordered collection of the Parser and Detector services currently available 
> in an OSGi environment. This is troublesome in the case where there are for 
> example two Parser services that both cover the same media type, and one of 
> them should be preferred. The OSGi way for specifying such preferences is 
> through [service 
> ranking|http://www.osgi.org/javadoc/r4v42/org/osgi/framework/Constants.html#SERVICE%5FRANKING],
>  but that currently doesn't work with Tika.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1026) ServiceLoader should respect OSGi service ranking

2012-11-19 Thread Jukka Zitting (JIRA)
Jukka Zitting created TIKA-1026:
---

 Summary: ServiceLoader should respect OSGi service ranking
 Key: TIKA-1026
 URL: https://issues.apache.org/jira/browse/TIKA-1026
 Project: Tika
  Issue Type: Improvement
  Components: config
Affects Versions: 1.2
Reporter: Jukka Zitting
Assignee: Jukka Zitting


Currently the ServiceLoader and TikaActivator classes simply maintain an 
unordered collection of the Parser and Detector services currently available in 
an OSGi environment. This is troublesome in the case where there are for 
example two Parser services that both cover the same media type, and one of 
them should be preferred. The OSGi way for specifying such preferences is 
through [service 
ranking|http://www.osgi.org/javadoc/r4v42/org/osgi/framework/Constants.html#SERVICE%5FRANKING],
 but that currently doesn't work with Tika.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-11-07 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-953:
---

Fix Version/s: (was: 1.2)

As of COMPRESS-191, Commons Compress can detect this issue for us. Once Commons 
Compress 1.5 is available we can upgrade to that version and remove the extra 
octet-stream condition in ZipContainerDetector.detectArchiveFormat().

> Tika failed to recognize non-ustar Tar  file?
> -
>
> Key: TIKA-953
> URL: https://issues.apache.org/jira/browse/TIKA-953
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.1
>Reporter: Jing Li
> Attachments: test2.tar, test.tar
>
>
> The file type indeed is "POSIX tar archive (GNU)" when I use command "file" 
> in linux, but Tika recognize it as "application/xhtml+xml".  The class I used 
> with is DefaultDetector. 
> Below is the head data of the file:
> 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 
> 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 
> 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 
> 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 
> 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 
> 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1012) Add additional fields to MimeType reader

2012-11-07 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492428#comment-13492428
 ] 

Jukka Zitting commented on TIKA-1012:
-

Looks good, though it would be better if such custom fields were namespaced as 
the shared mime-info database spec says: "Applications may also define their 
own elements, provided they are namespaced to prevent collisions."

> Add additional fields to MimeType reader
> 
>
> Key: TIKA-1012
> URL: https://issues.apache.org/jira/browse/TIKA-1012
> Project: Tika
>  Issue Type: New Feature
>  Components: mime
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: TIKA-1012-MimeMeta.patch
>
>
> Currently the MimeType class exposes a description (_comment).  It would be 
> nice to also expose:
>  * Acronym (this is already in tika-mimetypes.xml, see BMP)
>  * Links, add helper docs for some formats
>  * UTI, http://en.wikipedia.org/wiki/Uniform_Type_Identifier
> A sample entry would look like this:
> {code:xml}
>  
> 
> BMP
> <_comment>Windows bitmap
> <_link>http://en.wikipedia.org/wiki/BMP_file_format
> <_uti>com.microsoft.bmp
> 
>   ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1009) Expose TextDocument in BoilerpipeContentHandler

2012-11-07 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1009.
-

Resolution: Fixed
  Assignee: Jukka Zitting  (was: Ken Krugler)

Thanks! Patch applied in revision 1406663.

> Expose TextDocument in BoilerpipeContentHandler
> ---
>
> Key: TIKA-1009
> URL: https://issues.apache.org/jira/browse/TIKA-1009
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.3
>
> Attachments: TIKA-1009-1.3-1.patch
>
>
> The BoilerpipeContentHandler builds a TextDocument but external programs 
> cannot access it. This issue introduces a getTextDocument() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements

2012-11-07 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492342#comment-13492342
 ] 

Jukka Zitting commented on TIKA-1017:
-

The idea behind DefaultHtmlMapper is to try to normalize and simplify the 
incoming HTML as much as possible while still preserving the semantic structure 
of the document. We can add extra elements if there's a good use case that's 
not already covered by the IdentifyHtmlMapper class.

> DefaultHtmlMapper misses some safe elements
> ---
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based 
> on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like  and  are not included in the safe list. Is this 
> intentional (a comment with the rationale would be useful) or should they be 
> added?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-93) OCR support

2012-11-07 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492336#comment-13492336
 ] 

Jukka Zitting commented on TIKA-93:
---

JavaOCR looks interesting, and it looks like it's also available on the central 
Maven repository.

> OCR support
> ---
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-799) ForkParser does not populate metadata object after completing a parse

2012-11-07 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-799.


   Resolution: Fixed
Fix Version/s: 1.3
 Assignee: Jukka Zitting

Fixed in revision 1406597 by capturing any output metadata from the XHTML head 
and adding it to the Metadata object.

> ForkParser does not populate metadata object after completing a parse
> -
>
> Key: TIKA-799
> URL: https://issues.apache.org/jira/browse/TIKA-799
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.0
> Environment: Mac OS X 10.6, Java 6, Tika 1.0
>Reporter: Arthur Meneau
>Assignee: Jukka Zitting
>  Labels: metadata, parser
> Fix For: 1.3
>
>
> ForkParser does not add any new metadata to the metadata object passed to the 
> parse method.
> See these two links which describe the problem:
> http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser
> http://mail-archives.apache.org/mod_mbox/tika-user/201112.mbox/%3CCAOFYJNZ7hsATwfFbMmECjYY9Cii5BuJH7qGfXC08Yrqx%2BDmTvA%40mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1018) tika-bundle-0.6 missing poi-ooxml-schemas.jar

2012-11-06 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1018.
-

Resolution: Incomplete
  Assignee: Jukka Zitting

Please upgrade to the latest Tika version (1.2) and reopen if the problem still 
occurs.

> tika-bundle-0.6 missing poi-ooxml-schemas.jar
> -
>
> Key: TIKA-1018
> URL: https://issues.apache.org/jira/browse/TIKA-1018
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 0.6
>Reporter: Akash Kotadia
>Assignee: Jukka Zitting
>Priority: Critical
>
> Tika-bundle-0.6 missing poi-ooxml-schemas.jar. As a result its throwing 
> classNotFound exception if you try to use tika with .docx or any microsoft 
> office documents. It would be nice if we can have complete bundle which 
> contains its all dependencies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-968) tika-bundle missing org.apache.commons.logging.LogFactory

2012-08-05 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-968.


   Resolution: Fixed
Fix Version/s: 1.3
 Assignee: Jukka Zitting

The lack of the Commons Logging dependency is actually intentional, though the 
fact that is's marked as optional in the bundle manifest is not. I fixed that 
in revision 1369624.

The reason why we'd rather not include the logging library inside tika-bundle 
is that a deployment probably wants to use something like Pax Logging or 
another OSGi Log service implementation to control how log messages are 
handled. That's only possible if tika-bundle imports the logging packages from 
another bundle.

> tika-bundle missing org.apache.commons.logging.LogFactory
> -
>
> Key: TIKA-968
> URL: https://issues.apache.org/jira/browse/TIKA-968
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>Assignee: Jukka Zitting
> Fix For: 1.3
>
>
> Trying to invoke the PDFParser from the tika-bundle results in this error:
> java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
>   at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1089)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:421)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-966) org.apache.tika.Tika missing from tika-bundle-1.2.jar

2012-08-05 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-966.


   Resolution: Fixed
Fix Version/s: 1.3
 Assignee: Jukka Zitting

I updated Tika documentation to more accurately describe the current version of 
{{tika-bundle}}.

The start sequence problem you describe is most likely just a matter of timing 
in your test case. Tika uses a {{ServiceTracker}} to keep track of all 
available parser services, so the order in which the bundles are started 
shouldn't matter.

Since your test case as included in the Eclipse bug report you filed runs 
already in the _activator_, I suspect that you're simply experiencing a race 
condition where the parser services have not yet become available by the time 
your test code runs. The explicit {{tikaBundle.start()}} call you show in the 
last comment breaks this race by forcing the bundle to finish activating itself 
before your test case is run.

A better solution would be to move your test code out of the activator to a 
place where it gets invoked only once all relevant bundles are active. 

> org.apache.tika.Tika missing from tika-bundle-1.2.jar
> -
>
> Key: TIKA-966
> URL: https://issues.apache.org/jira/browse/TIKA-966
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>Assignee: Jukka Zitting
> Fix For: 1.3
>
>
> In version 0.9 I was able to use the org.apache.tika.Tika class in my OSGi 
> application simply by including the tika-bundle.
> I am now trying to move to version 1.2 and this class is missing from the 
> tika-bundle-1.2.jar
> The top level documentation says "Tika bundle. An OSGi bundle that includes 
> everything you need to use all Tika functionality in an OSGi environment." so 
> I think that this should be continued to included.
> I tried including the tika-core-1.2.jar into my application, but even after 
> fiddling with various start levels I can't get the Tika class to find the 
> parsers in the tika-bundle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-05 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-970.


   Resolution: Fixed
Fix Version/s: 1.3
 Assignee: Jukka Zitting

In revision 1369608 I merged the information to Tika's media type registry. 
Thanks!

> Full identification of the JPEG 2000 family of formats
> --
>
> Key: TIKA-970
> URL: https://issues.apache.org/jira/browse/TIKA-970
> Project: Tika
>  Issue Type: New Feature
>  Components: mime
>Affects Versions: 1.3
>Reporter: Andrew Jackson
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 1.3
>
> Attachments: custom-mimetype.xml
>
>
> Please find attached a suitable set of magic definitions for allowing Tika to 
> identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
> formats. It is based on the 'file' magic from 
> [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
> the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-956) Embedded docs in Word doc are not inlined (text is always added to the end)

2012-08-05 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428851#comment-13428851
 ] 

Jukka Zitting commented on TIKA-956:


Instead of the non-standard {{embedded}} attribute, it would be better to use a 
construct like {{}} for this.  A code 
snipped like the following should do the trick:

{code}
String id = "_" + field.getMarkSeparatorCharacterRun(r).getPicOffset();
AttributesImpl attributes = new AttributesImpl();
attributes.addAttribute("", "class", "class", "CDATA", "embedded");
attributes.addAttribute("", "id", "id", "CDATA", id);
handler.startElement(XHTML, "div", "div", attributes);
{code}

An even better approach would be to use something like {{}} or {{...}}. See the {{XWPFWordExtractorDecorator}} class for an 
example of how embedded images are handled in OOXML Word documents.

> Embedded docs in Word doc are not inlined (text is always added to the end)
> ---
>
> Key: TIKA-956
> URL: https://issues.apache.org/jira/browse/TIKA-956
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: TIKA-956.patch, TIKA-956.patch
>
>
> You can see this with the recently added testWORD_embedded_pdf.doc
> (for TIKA-948): the "Bye Bye" text comes before the "Wer
> wjelrwoierj..." text from the embedded PDF, opposite of what you see
> when you open the doc with Word.
> Yet, the thumbnail images do seem to be extracted at the right place
> (inlined).
> This is because WordExtractor.java has a separate pass at the end to
> visit the embedded docs.
> Would it be possible to recurse into an embedded doc at the point when
> it's first encountered instead...?  Or maybe somehow correlate the
> images with their corresponding attachment (right now they are just
> named image1, image2, ...)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428183#comment-13428183
 ] 

Jukka Zitting commented on TIKA-970:


Normally when we directly copy code or data from other projects we record that 
explicitly in our 
[LICENSE.txt|http://svn.apache.org/repos/asf/tika/trunk/LICENSE.txt] file (see 
for example the entry about MIME type info from the file command). The licenses 
that are deemed acceptable for such use are outlined in 
http://www.apache.org/legal/resolved.html.

In this case though it seems like you've just applied information acquired from 
the upstream source. Since facts can't be copyrighted and the amount of 
information used is below what I think could reasonably be claimed to be a 
database covered by copyright (unlike our use of the MIME type registry of the 
file command), I don't believe there's any need for extra licensing apart from 
what you already did above.

> Full identification of the JPEG 2000 family of formats
> --
>
> Key: TIKA-970
> URL: https://issues.apache.org/jira/browse/TIKA-970
> Project: Tika
>  Issue Type: New Feature
>  Components: mime
>Affects Versions: 1.3
>Reporter: Andrew Jackson
>Priority: Minor
> Attachments: custom-mimetype.xml
>
>
> Please find attached a suitable set of magic definitions for allowing Tika to 
> identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
> formats. It is based on the 'file' magic from 
> [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
> the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

2012-08-02 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-709.


Resolution: Fixed

It's better to file a new issue than reopen an old one whose fix has already 
been released. Thus re-resolving.

Re: CloseShieldInputStream; I don't see the need for that or for explicitly 
closing the streams in the finally block. The socket.close() call already takes 
care of releasing all resources, and there shouldn't be any need to explicitly 
protect the input stream from being closed. Please follow up on dev@ or in a 
separate issue if I'm missing something.

> Tika network server does not print anything in response to, for example, Word 
> documents
> ---
>
> Key: TIKA-709
> URL: https://issues.apache.org/jira/browse/TIKA-709
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.9
> Environment: Debian Linux Sid
>Reporter: Vitaliy Filippov
>Assignee: Jukka Zitting
> Fix For: 0.10
>
> Attachments: tika-709.diff
>
>
> When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to 
> parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do 
> anything more, it simply hangs, probably blocked on a socket read. This does 
> not happend with, for example, HTML documents. I don't know the mechanics of 
> this bug, but the following change definitely fixes the issue:
> Change
> type.process(socket.getInputStream(), output);
> to
> type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-885) Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426842#comment-13426842
 ] 

Jukka Zitting commented on TIKA-885:


What I had in mind was something like a {{Metadata.copyFrom(Metadata)}} method 
that would copy all metadata from one instance to another. We'd then have three 
{{Metadata}} instances, one for the client, one for the parser and a shared one 
for passing updates from the parser to the client. Each {{write()}} in the 
background parser would do something like:

{code}
synchronized (sharedMetadata) {
sharedMetadata.copyFrom(parserMetadata);
}
{code}

... and each {{read()}} by the client would do:

{code}
synchronized (sharedMetadata) {
clientMetadata.copyFrom(sharedMetadata);
}
{code}

It's not terribly elegant, but should avoid the need to make all {{Metadata}} 
instances thread-safe.

bq. customized versions of PipedReader and PipedWriter classes that work 
concurrently

I'm not sure I understand. Perhaps you could describe the idea in more detail 
either on the dev@ list or in a separate improvement issue.

> Possible ConcurrentModificationException while accessing Metadata produced by 
> ParsingReader
> ---
>
> Key: TIKA-885
> URL: https://issues.apache.org/jira/browse/TIKA-885
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.0
> Environment: jre 1.6_25 x64 and Windows7 Enterprise x64
>Reporter: Luis Filipe Nassif
>Priority: Minor
>  Labels: patch
>
> Oracle PipedReader and PipedWriter classes have a bug that do not allow them 
> to execute concurrently, because they notify each other only when the pipe is 
> full or empty, and do not after a char is read or written to the pipe. So i 
> modified ParsingReader to use modified versions of PipedReader and 
> PipedWriter, similar to gnu versions of them, that work concurrently. 
> However, sometimes and with certain files, i am getting the following error:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
> at java.util.HashMap$KeyIterator.next(Unknown Source)
> at java.util.AbstractCollection.toArray(Unknown Source)
> at org.apache.tika.metadata.Metadata.names(Metadata.java:146)
> It is because the ParsingReader.ParsingTask thread is writing metadata while 
> it is being read by the ParsingReader thread, with files containing metadata 
> beyond its initial bytes. It will not occur with the current implementation, 
> because java PipedReader and PipedWriter block each other, what is a 
> performance bug that affect ParsingReader, but they could be fixed in a 
> future java release. I think it would be a defensive approach to turn access 
> to the private Metadata.metadata Map synchronized, what could avoid a 
> possible future problem using ParsingReader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-966) org.apache.tika.Tika missing from tika-bundle-1.2.jar

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426752#comment-13426752
 ] 

Jukka Zitting commented on TIKA-966:


See also the 
[BundleIT|http://svn.apache.org/repos/asf/tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java]
 test case that we use to automatically verify that code like the one you 
describe works also in an OSGi environment.

> org.apache.tika.Tika missing from tika-bundle-1.2.jar
> -
>
> Key: TIKA-966
> URL: https://issues.apache.org/jira/browse/TIKA-966
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>
> In version 0.9 I was able to use the org.apache.tika.Tika class in my OSGi 
> application simply by including the tika-bundle.
> I am now trying to move to version 1.2 and this class is missing from the 
> tika-bundle-1.2.jar
> The top level documentation says "Tika bundle. An OSGi bundle that includes 
> everything you need to use all Tika functionality in an OSGi environment." so 
> I think that this should be continued to included.
> I tried including the tika-core-1.2.jar into my application, but even after 
> fiddling with various start levels I can't get the Tika class to find the 
> parsers in the tika-bundle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-966) org.apache.tika.Tika missing from tika-bundle-1.2.jar

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426748#comment-13426748
 ] 

Jukka Zitting commented on TIKA-966:


You're looking at the wrong place. The dynamic parser services are looked up in 
{{DefaultParser.getParsers()}} separately for each {{parse()}} call, not in the 
static {{.getDefaultParsers()}} method that's called only once during 
initialization.

> org.apache.tika.Tika missing from tika-bundle-1.2.jar
> -
>
> Key: TIKA-966
> URL: https://issues.apache.org/jira/browse/TIKA-966
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>
> In version 0.9 I was able to use the org.apache.tika.Tika class in my OSGi 
> application simply by including the tika-bundle.
> I am now trying to move to version 1.2 and this class is missing from the 
> tika-bundle-1.2.jar
> The top level documentation says "Tika bundle. An OSGi bundle that includes 
> everything you need to use all Tika functionality in an OSGi environment." so 
> I think that this should be continued to included.
> I tried including the tika-core-1.2.jar into my application, but even after 
> fiddling with various start levels I can't get the Tika class to find the 
> parsers in the tika-bundle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-966) org.apache.tika.Tika missing from tika-bundle-1.2.jar

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426678#comment-13426678
 ] 

Jukka Zitting commented on TIKA-966:


bq. No where in this code path are the dynamically added parsers referenced.

That's as it should be. Instead of looking up available parser services once 
when the instance is created, Tika does so only when you actually ask it to 
parse something. That way parser services can dynamically come and go without 
the need to acquire new Tika instances that match the currently available 
services.

Note that for the dynamic parser service lookups to work properly, the 
tika-core bundle needs to be *started* instead of just *resolved*. See also 
TIKA-951 where the same question came up recently.

> org.apache.tika.Tika missing from tika-bundle-1.2.jar
> -
>
> Key: TIKA-966
> URL: https://issues.apache.org/jira/browse/TIKA-966
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>
> In version 0.9 I was able to use the org.apache.tika.Tika class in my OSGi 
> application simply by including the tika-bundle.
> I am now trying to move to version 1.2 and this class is missing from the 
> tika-bundle-1.2.jar
> The top level documentation says "Tika bundle. An OSGi bundle that includes 
> everything you need to use all Tika functionality in an OSGi environment." so 
> I think that this should be continued to included.
> I tried including the tika-core-1.2.jar into my application, but even after 
> fiddling with various start levels I can't get the Tika class to find the 
> parsers in the tika-bundle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426550#comment-13426550
 ] 

Jukka Zitting commented on TIKA-965:


I see where you're going, but it's a really tricky path. I tried doing 
something like that earlier on, but I found no easy way to keep down the number 
of false positives.

The ICU4J classes are written with the assumption that the data you're working 
on is always text and they just figure out which character encoding is most 
likely. They fail to take into account the possibility of the document being in 
some unknown binary format.

That's why we currently run the full ICU4J encoding detection (using the 
{{o.a.t.parser.txt.Icu4jEncodingDetector}} and 
{{o.a.t.detect.AutoDetectReader}} classes, see TIKA-322 and TIKA-471) only once 
we already know by other means that we're dealing with textual data.

> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> 
>
> Key: TIKA-965
> URL: https://issues.apache.org/jira/browse/TIKA-965
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Ray Gauss II
> Attachments: 
> 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
>
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

2012-08-01 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-965:
---

Attachment: 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch

The attached patch implements the above idea. It seems to work fine with the 
UTF-8 demo in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt, 
though I don't know if we can include that file in Tika as a test case. Old 
texts from China, Middle-East or other non-Latin areas of the world might be a 
good source of copyright-free test data.

bq. Are we likely to run into similar issues with other encodings besides UTF-8?

Probably, though I think the best way to deal with them is case-by-case based 
on concrete issues people face. AFAICT there's no generic solution to this 
problem.

> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> 
>
> Key: TIKA-965
> URL: https://issues.apache.org/jira/browse/TIKA-965
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Ray Gauss II
> Attachments: 
> 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
>
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426490#comment-13426490
 ] 

Jukka Zitting commented on TIKA-965:


I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. We 
borrowed them from ICU4J, and though they cover a lot of exotic corner cases, 
they're pretty slow and cumbersome to use with the vast majority of text out 
there.

An alternative that should work fairly well is to leverage the existing 
{{TextStatistics}} class in {{tika-core}} for a quick check of whether there 
are as many UTF-8 continuation bytes in the text as there should be. Something 
like the following might be a good approximation:

{code}
public boolean looksLikeUTF8() {
int control = count(0, 0x20);
int utf8 = count(0x20, 0x80);
int safe = countSafeControl();

int expectedContinuation = 0;
int[] leading = new int[] {
count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
for (int i = 0; i < leading.length; i++) {
utf8 += leading[i];
expectedContinuation += (i + 1) * leading[i];
}

int continuation = count(0x80, 0xc0);
return utf8 > 0
&& continuation <= expectedContinuation
&& continuation >= expectedContinuation - 3
&& count(0xf80, 0x100) == 0
&& (control - safe) * 100 < utf8 * 2;
}
{code}

> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> 
>
> Key: TIKA-965
> URL: https://issues.apache.org/jira/browse/TIKA-965
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Comment Edited] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

2012-08-01 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426490#comment-13426490
 ] 

Jukka Zitting edited comment on TIKA-965 at 8/1/12 10:12 AM:
-

I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. We 
borrowed them from ICU4J, and though they cover a lot of exotic corner cases, 
they're pretty slow and cumbersome to use with the vast majority of text out 
there.

An alternative that should work fairly well is to leverage the existing 
{{TextStatistics}} class in {{tika-core}} for a quick check of whether there 
are as many UTF-8 continuation bytes in the text as there should be. Something 
like the following might be a good approximation:

{code}
public boolean looksLikeUTF8() {
int control = count(0, 0x20);
int utf8 = count(0x20, 0x80);
int safe = countSafeControl();

int expectedContinuation = 0;
int[] leading = new int[] {
count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
for (int i = 0; i < leading.length; i++) {
utf8 += leading[i];
expectedContinuation += (i + 1) * leading[i];
}

int continuation = count(0x80, 0xc0);
return utf8 > 0
&& continuation <= expectedContinuation
&& continuation >= expectedContinuation - 3
&& count(0xf8, 0x100) == 0
&& (control - safe) * 100 < utf8 * 2;
}
{code}

  was (Author: jukkaz):
I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. 
We borrowed them from ICU4J, and though they cover a lot of exotic corner 
cases, they're pretty slow and cumbersome to use with the vast majority of text 
out there.

An alternative that should work fairly well is to leverage the existing 
{{TextStatistics}} class in {{tika-core}} for a quick check of whether there 
are as many UTF-8 continuation bytes in the text as there should be. Something 
like the following might be a good approximation:

{code}
public boolean looksLikeUTF8() {
int control = count(0, 0x20);
int utf8 = count(0x20, 0x80);
int safe = countSafeControl();

int expectedContinuation = 0;
int[] leading = new int[] {
count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
for (int i = 0; i < leading.length; i++) {
utf8 += leading[i];
expectedContinuation += (i + 1) * leading[i];
}

int continuation = count(0x80, 0xc0);
return utf8 > 0
&& continuation <= expectedContinuation
&& continuation >= expectedContinuation - 3
&& count(0xf80, 0x100) == 0
&& (control - safe) * 100 < utf8 * 2;
}
{code}
  
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> 
>
> Key: TIKA-965
> URL: https://issues.apache.org/jira/browse/TIKA-965
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

2012-07-31 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425797#comment-13425797
 ] 

Jukka Zitting commented on TIKA-965:


In the {{TextDetector}} we could also look for the characteristic UTF-8 
multi-byte sequences, as they should be easy to spot and pretty reliable as 
indicators of UTF-8 encoded text.

> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> 
>
> Key: TIKA-965
> URL: https://issues.apache.org/jira/browse/TIKA-965
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-966) org.apache.tika.Tika missing from tika-bundle-1.2.jar

2012-07-31 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425794#comment-13425794
 ] 

Jukka Zitting commented on TIKA-966:


In 1.0 we excluded tika-core from tika-bundle as it works better that way with 
dynamically loaded parser services. Unfortunately it looks like we forgot to 
update the relevant documentation.

You should be able to get your deployment working simply by including tika-core 
in addition to tika-bundle.

> org.apache.tika.Tika missing from tika-bundle-1.2.jar
> -
>
> Key: TIKA-966
> URL: https://issues.apache.org/jira/browse/TIKA-966
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.2
>Reporter: Gary Karasiuk
>
> In version 0.9 I was able to use the org.apache.tika.Tika class in my OSGi 
> application simply by including the tika-bundle.
> I am now trying to move to version 1.2 and this class is missing from the 
> tika-bundle-1.2.jar
> The top level documentation says "Tika bundle. An OSGi bundle that includes 
> everything you need to use all Tika functionality in an OSGi environment." so 
> I think that this should be continued to included.
> I tried including the tika-core-1.2.jar into my application, but even after 
> fiddling with various start levels I can't get the Tika class to find the 
> parsers in the tika-bundle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-872) Tika --extract fails for RTF

2012-07-09 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-872:
---

Fix Version/s: (was: 1.3)
   Issue Type: New Feature  (was: Bug)

Our RTF parser doesn't yet support embedded documents, so this would be a new 
feature.

See the RTFParser and TextExtractor classes in o.a.t.parser.rtf inside the 
tika-parsers components for the place where something like this should be 
implemented.

> Tika --extract fails for RTF
> 
>
> Key: TIKA-872
> URL: https://issues.apache.org/jira/browse/TIKA-872
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows 7 with Java v1.6
>Reporter: Albert L.
> Attachments: embedded.rtf.zip
>
>
> A file that is embedded in an RTF file doesn't get extracted to disk.
> To "embed" a file into an RTF, simply drag-drop it into an RTF document when 
> using MS-Word 2010.  It will then create an EMF of the embedded file's 
> preview.
> See attached file "embedded.rtf.zip" for an example input file that fails 
> with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2012-07-09 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-754.


   Resolution: Won't Fix
Fix Version/s: (was: 1.3)

Resolving as Won't Fix as explained above.

> Automatic line break insertion (BR element) instead of '\n' in 
> XHTMLContentHandler
> --
>
> Key: TIKA-754
> URL: https://issues.apache.org/jira/browse/TIKA-754
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 0.10, 1.0
>Reporter: Pablo Queixalos
>Priority: Minor
> Attachments: TIKA-754.poc.patch
>
>
> As seen with some parsers (PDF, PPT), some text blocks still contains text 
> carriage returns ('\n') in the outputted XHTML. 
> A global fix for this could be located in XHTMLContentHandler.characters(...).
> By analyzing the given char array, when a '\n' char is encountered insert a 
> BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-815) Tika parsers should handle failures more gracefully

2012-07-08 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-815.


Resolution: Duplicate

Resolving this as a duplicate of all the followup issues mentioned above.

> Tika parsers should handle failures more gracefully
> ---
>
> Key: TIKA-815
> URL: https://issues.apache.org/jira/browse/TIKA-815
> Project: Tika
>  Issue Type: Test
>  Components: parser
>Affects Versions: 1.0
>Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the 
> failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and 
> other potential issues that will be reported to the individual parsers. It's 
> the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of 
> tests.
> So we also submit this issue here to start a discussion on what role should 
> Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-885) Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader

2012-07-08 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409102#comment-13409102
 ] 

Jukka Zitting commented on TIKA-885:


Hmm, that is a good point! I guess the best way to solve this, apart from 
making Metadata fully synchronized, would be to pass a copy of the given 
metadata object to the parsing process in the background thread, and then 
explicitly copy any updates back to the original Metadata instance when the 
client calls read() or other methods on the reader instance. A bit like how we 
handle the transmission of an exception across the threads.

> Possible ConcurrentModificationException while accessing Metadata produced by 
> ParsingReader
> ---
>
> Key: TIKA-885
> URL: https://issues.apache.org/jira/browse/TIKA-885
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.0
> Environment: jre 1.6_25 x64 and Windows7 Enterprise x64
>Reporter: Luis Filipe Nassif
>Priority: Minor
>  Labels: patch
>
> Oracle PipedReader and PipedWriter classes have a bug that do not allow them 
> to execute concurrently, because they notify each other only when the pipe is 
> full or empty, and do not after a char is read or written to the pipe. So i 
> modified ParsingReader to use modified versions of PipedReader and 
> PipedWriter, similar to gnu versions of them, that work concurrently. 
> However, sometimes and with certain files, i am getting the following error:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
> at java.util.HashMap$KeyIterator.next(Unknown Source)
> at java.util.AbstractCollection.toArray(Unknown Source)
> at org.apache.tika.metadata.Metadata.names(Metadata.java:146)
> It is because the ParsingReader.ParsingTask thread is writing metadata while 
> it is being read by the ParsingReader thread, with files containing metadata 
> beyond its initial bytes. It will not occur with the current implementation, 
> because java PipedReader and PipedWriter block each other, what is a 
> performance bug that affect ParsingReader, but they could be fixed in a 
> future java release. I think it would be a defensive approach to turn access 
> to the private Metadata.metadata Map synchronized, what could avoid a 
> possible future problem using ParsingReader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-456) Support timeouts for parsers

2012-07-08 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409101#comment-13409101
 ] 

Jukka Zitting commented on TIKA-456:


bq. Does ForkParser use a kind of timeout internally?

There is currently no timeout on the client side of the ForkParser, but the 
server uses a watchdog timer that forcibly shuts down the forked JVM process if 
there's been no interaction with the client in the last five seconds. That's 
basically the "some progress" approach outlined by Andrej above.

> Support timeouts for parsers
> 
>
> Key: TIKA-456
> URL: https://issues.apache.org/jira/browse/TIKA-456
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common 
> case is when a parser is fed an incomplete document, such as what happens 
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and 
> then use this with a FutureTask. For example, when using a ParsedDatum POJO 
> for the results of the parse operation, I do something like this:
> parser = new AutoDetectParser();
> Callable c = new TikaCallable(parser, contenthandler, 
> inputstream, metadata);
> FutureTask task = new  FutureTask(c);
> Thread t = new Thread(task);
> t.start();
> ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable {
> public TikaCallable(Parser parser, ContentHandler handler, InputStream 
> is, Metadata metadata) {
> _parser = parser;
> _handler = handler;
> _input = is;
> _metadata = metadata;
> ...
> }
> public ParsedDatum call() throws Exception {
> 
> _parser.parse(_input, _handler, _metadata, new ParseContext());
> 
> }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be 
> able to guarantee that none of the parsers being wrapped by Tika could ever 
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
> something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code 
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse 
> request, but I don't think the thread overhead is significant when compared 
> to the typical parser operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-892) Tika does not use the HTML5 meta charset tag when determining charset

2012-07-08 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-892.


   Resolution: Fixed
Fix Version/s: 1.2
 Assignee: Jukka Zitting  (was: Ken Krugler)

Thanks! I committed the patch in revision 1358872.

> Tika does not use the HTML5 meta charset tag when determining charset
> -
>
> Key: TIKA-892
> URL: https://issues.apache.org/jira/browse/TIKA-892
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
> Environment: Any
>Reporter: Chris Jones
>Assignee: Jukka Zitting
> Fix For: 1.2
>
> Attachments: tika-html5.patch
>
>
> HTML5 introduced a new way of specifying the charset for an HTML document:
> 
> The code in HtmlParser only looks for the HTML4 style meta http-equiv tag:
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

2012-07-08 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-431.


   Resolution: Fixed
Fix Version/s: 1.2
 Assignee: Jukka Zitting  (was: Ken Krugler)

In revision 1358858 I made the text and html parsers return character encoding 
information in the charset parameter of the returned content type. The content 
encoding field is still present for backwards compatibility, but I added a note 
to the CHANGES.txt mentioning that it should be considered deprecated.

> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> ---
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Reporter: Erik Hetzner
>Assignee: Jukka Zitting
> Fix For: 1.2
>
> Attachments: TIKA-431.patch
>
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, 
> compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; 
> charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in 
> gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata 
> object
> 3. Return charset=... declarations in the Metadata object if a charset is 
> detected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-242) Incremental configuration AutoDetectParser

2012-07-08 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-242.


Resolution: Duplicate

Resolving as a duplicate of the auto-loading mechanisms we added for detectors 
and parsers and the multiple type database support added in TIKA-746. The use 
case of this issue is already covered by those changes.

> Incremental configuration AutoDetectParser
> --
>
> Key: TIKA-242
> URL: https://issues.apache.org/jira/browse/TIKA-242
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.3
>Reporter: Jan Goyvaerts
> Attachments: TikaConfig-patch
>
>
> The AutoDetectParser accepts one totally new configuration or no 
> configuration. For the latter case a default configuration is used. It would 
> be nice if the former case is a configuration on top of the default 
> configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-430) Automatically let all valid XHTML 1.0 attributes through from HTML documents

2012-07-08 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-430.


Resolution: Incomplete

Resolving as incomplete since after two years there still isn't a patch for 
this issue. Feel free to reopen when there's something to add. The correct 
place to put something like this would be the SAFE_ATTRIBUTES map in 
DefaultHtmlMapper.

> Automatically let all valid XHTML 1.0 attributes through from HTML documents
> 
>
> Key: TIKA-430
> URL: https://issues.apache.org/jira/browse/TIKA-430
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> Many consumers of parse output wouldn't want to process the raw 
> (unnormalized) elements they'd get with the IdentityHtmlMapper, but they 
> would want to get any standard attributes. For example, with  elements 
> they would get any rel attribues.
> I believe this would require changing the DefaultHtmlMapper to "know" about 
> valid attributes for different elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   3   4   5   >