[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842403#comment-17842403
 ] 

Nick Burch commented on TIKA-4249:
--

I'd probably say we change the 0="From:" into "0=From" or "0=(UTF-8-BOM)From:", 
should be a little less likely to have false positives that way

First time I've come across a Byte Order Mark at the start of an email file 
though!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Copilot license for open source?

2024-04-22 Thread Nick Burch

On Sun, 21 Apr 2024, Michael Wechner wrote:
Thanks for the pointer to the Generative Tooling rules, which I was not 
aware of so far.


At the bottom it says, that the ASF does not tell developers what tools 
to use, but I think it would be useful to useful to have some concrete 
examples, which would make the rules more clear.


(Not a lawyer, not an official ASK response)

There's nothing special about LLMs and this, other than perhaps the speed 
with which you can make mistakes... When including other people's code, 
it's all about license compatibility and attribution


The ASF started when a bunch of people started sharing patches for a web 
server, with attribution and code under a compatible license. The 
foundation grew during a period where it got easier to find code + code 
snippets online, including much that wasn't under a compatible license. 
Rules didn't change, other than clarifying processes for checking licenses 
and what was/wasn't compatible.


You weren't, and still aren't, allowed to copy + paste large chunks of 
someone else's code without a compatible license and suitable attribution. 
Using a LLM to read all the internet and suggest the code to copy doesn't 
change that. Well, other than the well-documented issues with getting LLMs 
to cite their sources...


LLMs have loads of great uses, including helping you learn new things, 
decoding error messages, finding common patterns, rubber-ducking etc. 
They're even worse than many internet forums for suggesting large chunks 
of code of unclear provenance to copy+paste


It doesn't matter if it's ChatGPT, Github Co-pilot, a local LLM, someone 
on StackOverflow, or a YouTube video that's giving you some code you want 
to copy. 3 characters are almost certainly fine, 3 pages are almost 
certainly not, a general idea is often fine, and you absolutely need to 
engage your brain before committing to ASF repos!



Otherwise, if you do still think more rules / examples / etc are needed, 
you'll be wanting legal-discuss@

https://lists.apache.org/list.html?legal-disc...@apache.org

Cheers
Nick


Re: Copilot license for open source?

2024-04-21 Thread Nick Burch

On Fri, 19 Apr 2024, Nicholas DiPiazza wrote:

Can I get an open source license for GitHub copilot?


I've not heard of anyone offering that. Some of the open and open-ish 
models are quite good on coding tasks, though you'd need to hop to a 
different interface to ask for help (unlike the in-line way with github 
co-pilot)


Whatever you opt for, make sure you read + understand + follow the ASF 
Generative Tooling rules though!

https://www.apache.org/legal/generative-tooling.html

Nick


Re: junk cves -- rant

2024-04-12 Thread Nick Burch

On Thu, 11 Apr 2024, Tim Allison wrote:

I just excluded joda-time because of this: CVE-2024-23080
https://nvd.nist.gov/vuln/detail/CVE-2024-23080

This is an NPE in joda-time version 2.12.5. That's two versions before the
current...is it actually still in there. And more importantly, an NPE is
not a CVE in Java. People, please.


Have you seen all the rants from the Curl folks?
https://daniel.haxx.se/blog/2024/02/21/disputed-not-rejected/
https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/

Nick


Re: Document chunking

2024-04-08 Thread Nick Burch

On Mon, 8 Apr 2024, Tim Allison wrote:
Not sure we should jump on the bandwagon, but anything we can do to 
support smart chunking would benefit us.


Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse


I played around with chunking a bit late last year, but owing to not 
getting any of the AI jobs I went for, I didn't get it beyond a rough 
protype. I can say that most people are doing a terrible job in their 
out-of-the box configs...


My current suggested (but not fully tested) approach is:
 * Define a range of chunk sizes that you'd like (min / ideal / max)
 * Parse as XHTML with Tika
 * Keep track of headings and table headers
 * Break on headings
 * If a chunk is too big, break on other elements (eg div or p)
 * If a chunk is too small, and near other small chunks, join them
 * Include 1-2 headings above the current one at the top,
   as a targetted bit of Table of Contents. (eg chunk starts on H3, put
   the H2 in as well)
 * If you broke up a huge table, repeat the table headers at the
   start of every chunk
 * When you're done chunking + adding bits back at the top, convert
   to markdown on output

Happy to explain more! But sadly lacking time right now to do much on that

Nick

[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-26 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830867#comment-17830867
 ] 

Nick Burch commented on TIKA-4223:
--

A lot of the early file extension allocations were taken from the HTTPD mime 
magics, which for obscure formats is unlikely to be representative of use 
today. So, for something like this, I'm +1 to moving the glob to a more 
common/popular format that also shares the same extension

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017
 ] 

Nick Burch commented on TIKA-4210:
--

The attached file seems to be an RTF file. I'm not sure what a ".mega 
attachment" is, but this file doesn't seem to be one of them...

tika-app-2.9.1.jar is able to correctly identify this file as RTF

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .mega attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824965#comment-17824965
 ] 

Nick Burch commented on TIKA-4208:
--

I would expect that the json output version would need a bit more memory, as 
we'll have to hold all the content in memory before outputting instead of just 
streaming the text/html out as we go along. I wouldn't expect it to be 4gb vs 
32gb though!

Any ideas anyone? Is it possible we've got an extra layer (or 2?) of buffering 
above and beyond what we need for the {{-J}} option?

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824874#comment-17824874
 ] 

Nick Burch commented on TIKA-4208:
--

How much heap size do you have allocated?

The error suggests that Tika managed to decode the string in the SAS data file, 
but ran out of memory passing the string through the content handler stack to 
plain text. Generally things break at the decode step if they're going to, 
rather than the output!

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816788#comment-17816788
 ] 

Nick Burch commented on TIKA-3784:
--

>From [https://datatracker.ietf.org/doc/rfc7292/] it looks like PKCS12 is based 
>on PKCS7, so that's expected. There's a few more types defined in 
>[https://www.rfc-editor.org/rfc/rfc7292.html#appendix-D] - not sure if we can 
>find any of those to match on?

Thought [https://www.cs.auckland.ac.nz/~pgut001/pubs/pfx.html] does suggest 
this isn't an ideal format...

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4148) Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)

2023-11-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787608#comment-17787608
 ] 

Nick Burch commented on TIKA-4148:
--

For detection of the OLE2 based files, we don't need to find unique byte 
combinations, we only need to find unique OLE2 entry names / sets of names

See 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/POIFSContainerDetector.java#L362]
 for an example of "must have this then one of those"

If you can run POIFSLister (and/or POIFSDumper) on a bunch of files, and spot 
the entry names that are common (+ ideally not already in POIFSContainerDector 
for other ones), that's what we need

> Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)
> ---
>
> Key: TIKA-4148
> URL: https://issues.apache.org/jira/browse/TIKA-4148
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexey Pismenskiy
>Priority: Major
>
> Add support for Autodesk Inventor files in Tika. 
> Examples of the files can be downloaded from 
> [https://www.autodesk.com/support/technical/article/caas/tsarticles/ts/3gnm93P9sPAWE6vndk7fjq.html]
> It would be great to start at least at the metadata level and then add 
> content parsing later. 
> I suspect I would be something similar to 
> [DWGParser|[https://tika.apache.org/0.9/api/org/apache/tika/parser/dwg/DWGParser.html]|https://tika.apache.org/0.9/api/org/apache/tika/parser/dwg/DWGParser.html].],
>  
> any suggestions where to start looking are appreciated. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-4119:
-
Component/s: mime

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Matthias Juchmes
>Priority: Major
>  Labels: tika-3x
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-4119:
-
Labels: tika-3x  (was: )

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>Reporter: Matthias Juchmes
>Priority: Major
>  Labels: tika-3x
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-08-29 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759921#comment-17759921
 ] 

Nick Burch commented on TIKA-4119:
--

I wonder if this is a big enough change around Detection that we ought to wait 
for 3.x to make it. Thoughts anyone?

(We already define {{text/javascript}} as an alias for the type, so users can 
already define parsers etc for the text variant, but swapping the canonical and 
the alias is going to break a lot of detection uses if people don't update)

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>Reporter: Matthias Juchmes
>Priority: Major
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-08-02 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750344#comment-17750344
 ] 

Nick Burch commented on TIKA-4062:
--

Between holidays and the length of time needed for regression runs + votes, I 
suspect late August / early September

> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> ---
>
> Key: TIKA-4062
> URL: https://issues.apache.org/jira/browse/TIKA-4062
> Project: Tika
>  Issue Type: Bug
>  Components: tika-core
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
>Reporter: Ravi Ranjan Jha
>Priority: Critical
> Fix For: 2.8.1
>
>
> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> Prior to the change of passing OfflineContentHandler to SAX Parser in 
> XMLReaderUtils.parseSAX, one could pass a custom ContentHandlerDecorator to 
> handle exception or override error/warning etc methods. The same is not 
> possible now because the default impl for handleException in the 
> OfflineContentHandler's parent ContentHandlerDecorator just throws exception 
> as shown below:
>  
>  protected void handleException(SAXException exception) throws SAXException {
>         throw exception;
>     }
>  
> which could probably be (at minimum)
> public void handleException(SAXException exception) throws SAXException {
>         handler.handleException(exception);
>     }
>  
> This is breaking our app's behavior. Please take it as priority.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-07-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748454#comment-17748454
 ] 

Nick Burch commented on TIKA-4064:
--

Depends if anyone else on the PMC has the time to be release manager for it 
(sadly I don't). If we're relying on TIm once more, I suspect early September, 
as Tim's busy for a few weeks before he could start the release process going

> Update to 2.8.1
> ---
>
> Key: TIKA-4064
> URL: https://issues.apache.org/jira/browse/TIKA-4064
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.8.0
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 2.8.1
>
>
> The latest maven versions plugin finds much more outdated stuff than the 
> previous one, so I'll do a few updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Require Java 11 in 3.x

2023-07-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748452#comment-17748452
 ] 

Nick Burch commented on TIKA-3948:
--

[~solomax] I think the first task is to identify any other areas of Tika that 
will be affected by the switch. That may be an explicit dependency, but I fear 
it's more likely to be things a long way down the dependency tree in something 
(probably one of the scientific parsers with more sporadic updates). 

Once we know all the places that'll be affected, then we can come up with a 
plan for any changes needed directly in Tika, and a plan for any dependencies 
which need updates but where upstream haven't/won't do the matching ones. And 
then we can think about a preview release :)

> Require Java 11 in 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

2023-07-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741578#comment-17741578
 ] 

Nick Burch commented on TIKA-4098:
--

The more bytes beyond the start we check for the PDF marker, the more likely we 
are to mis-identify a different file as a PDF. The %PDF- marker is pretty 
unique at the start of a file, but progressively less so as the content 
continues. (Consider a markdown file of a talk on file formats, that could 
easily have the text "Look for %PDF- at the start" on page 10 and we don't want 
to mark the whole thing as a PDF!)

If you know for sure that a file is a PDF, just skip detection and tell Tika 
and we'll hand it off to the PDF parser!

If your use case has very few text-based formats, you can fairly safely bump 
the search window up. Out-of-the-box, I'd be very worried to push it much 
further due to the false positive risk

> Detection fails on PDF with garbage before header
> -
>
> Key: TIKA-4098
> URL: https://issues.apache.org/jira/browse/TIKA-4098
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: garbageBeforeHeader.pdf
>
>
> PDF detection fails on files that contain too much garbage before the header 
> 'PDF%-'.
> Those PDFs do not respect the specification, but are nonetheless correctly 
> handled by PDF viewers.
> The joined PDF is an example on the garbage found in a real-life PDF (looks 
> like email headers that 'leaked' onto the PDF file). The PDF itself is one 
> that I generated so that the exemple si small.
> The current magic for PDFs  limits the search for the '%PDF-%' header to 512 
> bytes, and in the joined PDF it's located after 702 garbage bytes.
> I looked at the sources of PdfBox and Ghostscript to see how they handle this 
> case and:
>  * Ghostscript searches through the entire file (see 
> [https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] 
> lines 1323-1339)
>  * PdfBox reads the file line by line, and stops looking for the header when  
> it encounters a line that starts with a digit (see 
> [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java]
>  lines 1561-)
> From the doc in tika-mimetypes.xml for the application/pdf MIME type, I 
> understand that increasing the maximum offset can trigger false positives. I 
> increased it to 768, and the unit tests pass, but I didn't find any PDF that  
> tests this particular case, so either it doesn't exist or there are 
> integration tests that aren't part of this repo ?
> How can I go about testing for regressions ? I can provide a pull request for 
> this change, but where do I put the test PDF and a unit test?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730728#comment-17730728
 ] 

Nick Burch commented on TIKA-4060:
--

I'm a muppet... had forgotten to escape the hex characters in the regexp when 
transposing into a Tika mime magic match!

Now fixed and applied. Thanks for helping us find this magic

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-4060.
--
Fix Version/s: 2.8.1
   Resolution: Fixed

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 2.8.1
>
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730649#comment-17730649
 ] 

Nick Burch commented on TIKA-4060:
--

0x494443 is the string ID3, which I think ought to be at the start. It is in 
the handful of files I've found. The rest of the magic is pretty vague and a 
little prone to false positives, so I'm reluctant to match on the string "ID3" 
anywhere in the first 2kb and then the vague 3 bytes somewhere else further on.

I've tried to make the matches a little "tighter" to hopefully reduce false 
positives, just seem to have gone too tight - the test file I produced with ID3 
tags does have the ID3 at the start. The hex dump key sections are:

{{ 49 44 33 03 00 00 00 00 09 6b 54 50 45 31 00 00 |ID3..kTPE1..|}}
{{0010 00 0c 00 00 00 54 65 73 74 20 41 72 74 69 73 74 |.Test Artist|}}
{{...}}
{{0090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||}}
{{*}}
{{04f0 00 00 00 00 00 ff f1 50 80 32 5f fc de 02 00 4c |...P.2_L|}}

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-07 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730304#comment-17730304
 ] 

Nick Burch commented on TIKA-4060:
--

I have created some small test AAC files using ffmpeg, and then had a go at 
adding the mime magic for the two cases. 

However, detection of the ID3 header case isn't working. Can anyone spot what 
I've done wrong? https://github.com/apache/tika/tree/TIKA-4060

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4051) Explore new parsers

2023-06-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728992#comment-17728992
 ] 

Nick Burch commented on TIKA-4051:
--

Last time I asked the MPXJ project they weren't interested in switching, but 
it's always worth another try after a few years! Very old plugin is 
https://github.com/Gagravarr/MPXJ-Tika if anyone wants to help bring it a bit 
more up-to-date?

> Explore new parsers
> ---
>
> Key: TIKA-4051
> URL: https://issues.apache.org/jira/browse/TIKA-4051
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> Let's use this ticket as a parking lot for links to parsers that might be 
> interesting to integrate.
> Here's an ASL 2.0 RTFParser: [https://github.com/joniles/rtfparserkit/] 
> single developer, and release was last year.  We'd want to do a bakeoff 
> before making the switch, but it would be nice to offload our custom 
> RTFParser.
>  
> This library parses project plans: [https://github.com/joniles/mpxj] It is 
> LGPL, which is incompatible with ASL 2.0.  So it is a non-starter now, but if 
> there's interest in integrating with Tika, we might ask the mpxj project if 
> they'd have any interest in changing their license.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3999) audio/xm audio/x-mod

2023-05-23 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725561#comment-17725561
 ] 

Nick Burch commented on TIKA-3999:
--

Oh, this brings back memories... good memories :)

Unless we can enlist the help of some dedicated members of the "demo scene", I 
think a parser is unlikely any time soon.

>From the table provided (wow! thanks!), I think we can probably add a whole 
>bunch of subtypes of {{audio/x-mod}} which we can then detect. Just need to 
>use the regression suite to ensure that some of the shorter magic entries are 
>sufficiently unique - the 2-4 byte ones worry me a little bit. May need to add 
>some as subtype with file extension but not magic where it isn't unique enough

> audio/xm audio/x-mod
> 
>
> Key: TIKA-3999
> URL: https://issues.apache.org/jira/browse/TIKA-3999
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4045) DBF/MDB row count extraction

2023-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724302#comment-17724302
 ] 

Nick Burch commented on TIKA-4045:
--

I guess this could also apply for other row-based formats like SQLite or 
Spreadsheets? Though I'm not sure how best to output it on a per-table / 
per-sheet basis.

For the metadata keys, I guess we could re-use the same ones as we added for 
CSV in TIKA-3938 ?

> DBF/MDB row count extraction
> 
>
> Key: TIKA-4045
> URL: https://issues.apache.org/jira/browse/TIKA-4045
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
>
> It would be quite helpful for my organization to extract the number of 
> records/rows in any given database file format like DBF or MDB. Along with 
> byte count this would give us a good idea of the amount of information stored 
> in the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4025) Extract frame count from gifs

2023-05-02 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718674#comment-17718674
 ] 

Nick Burch commented on TIKA-4025:
--

Would a video metadata specification's frame count be a better home? 

XMP seems to have a pretty complex FrameCount type, from a quick glance I 
couldn't spot an obvious property using that but I feel like there ought to be 
one...

> Extract frame count from gifs
> -
>
> Key: TIKA-4025
> URL: https://issues.apache.org/jira/browse/TIKA-4025
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> Over on TIKA-4019, an animated gif example made me realize that we're not 
> currently extracting the number of frames for gifs into the metadata.  We 
> should do this.
>  
> Any recs for the name of the metadata key?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: idea about creation of accounts

2023-03-13 Thread Nick Burch

On Mon, 13 Mar 2023, Nicholas DiPiazza wrote:
can we require that the request form for creating a jira account 
contains the first issue they would like to create?


You'd need to ask on users@infra about that, it's an ASF wide thing (to 
avoid a huge spam problem) and not something our project currently can 
configure


Nick


[jira] [Commented] (TIKA-3981) Tika parser meets window system file

2023-02-24 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693140#comment-17693140
 ] 

Nick Burch commented on TIKA-3981:
--

Is this happening for all executables on your machine, or just some? And if is 
there any pattern on which executables are showing sensible dates and which are 
showing future ones?

Does Windows Explorer show a more sensible date?

Can anyone reproduce this with a small file from an open source project?

(We have 8 test files in our test suite, all of which are coming back with 
sensible dates, so need some help to track down more details on this bug!)

> Tika parser meets window system file
> 
>
> Key: TIKA-3981
> URL: https://issues.apache.org/jira/browse/TIKA-3981
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: ASK_Tika_Parser.docx
>
>
> Hi All,
>  
>    I execute the command "java -jar tika-app-2.7.0.jar." and load the 
> windows system execute file where.exe. 
>   You could find the file in your own windows system, 
> c:\Windows\systen32\where.exe.
>   Tika gets the dcterms:created, "2037-03-05T20:49:08Z" , but I get 
> confused the future time. 
>   Could you help check why tika gets the special created date, please?  
>  
>  Attachment is also my testing with several tika versions, for your 
> reference. 
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689199#comment-17689199
 ] 

Nick Burch commented on TIKA-3973:
--

If you only care about container-aware detection for Ogg based formats, you 
should be fine right now with just

{code:java}
implementation 'org.apache.tika:tika-core:2.7.0'
implementation 'org.gagravarr:vorbis-java-tika:0.8'
{code}

The Vorbis Tika module should pull in the other things it needs (such as core)

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689176#comment-17689176
 ] 

Nick Burch commented on TIKA-3973:
--

For all container formats you want {{tika-parsers}} or {{tika-parsers-standard}}

If you only care about the Ogg formats, then {{vorbis-java-tika}} from 
{{org.gagravarr}} is enough

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161
 ] 

Nick Burch edited comment on TIKA-3973 at 2/15/23 2:38 PM:
---

For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc


was (Author: gagravarr):
For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc{{{}
{}}}

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161
 ] 

Nick Burch commented on TIKA-3973:
--

For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc{{{}
{}}}

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3960) PGP encrypted files get detected as application/octet-stream

2023-01-30 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682352#comment-17682352
 ] 

Nick Burch commented on TIKA-3960:
--

If possible, please include a small test file and update 
{{tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java}} to test 
the detection

> PGP encrypted files get detected as application/octet-stream
> 
>
> Key: TIKA-3960
> URL: https://issues.apache.org/jira/browse/TIKA-3960
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.6.0
>Reporter: Tayseer Sabha
>Priority: Major
>
> We use Tika for detecting and validating uploaded files using their 
> content/magic bytes and not only their names/extension.
> Passing a PGP/GPG encrypted file to Tika.detect(InputStream stream) will 
> always return application/octet-stream instead of application/pgp-encrypted 
> defined in tika-mimetypes.xml
> The issue occurs because the application/pgp-encrypted mime-type defined in 
> tika-mimetypes.xml is lacking a magic match and only has  pattern="*.pgp"/>
> I managed to fix the issue for us temporarily by adding 
> application/pgp-encrypted including a magic match in our custom-mimetypes.xml 
> file. I will create a Pull Request on Github with the fix to resolve the 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677364#comment-17677364
 ] 

Nick Burch commented on TIKA-3703:
--

I guess we could include a data package metadata file to better describe the 
other files in the zip? 
[https://specs.frictionlessdata.io/data-package/#introduction]

That might make it "more standard" for people to understand what they've got 
and why

> Consider adding a frictionless data package output format
> -
>
> Key: TIKA-3703
> URL: https://issues.apache.org/jira/browse/TIKA-3703
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for 
> thumbnails, or embedded images or embedded files or rendered pages, it would 
> be great to return that data in a standard format. Our current /unpack 
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these 
> byte streams as base64 encoded metadata values in our current metadata 
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server 
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677326#comment-17677326
 ] 

Nick Burch commented on TIKA-3703:
--

A zip file gives you compression, and most clients won't accidentally try to 
buffer it in memory. JSON with base-64 encoded data is negative compression, 
and a high risk of clients OOM-ing due to trying to fit all of the raw JSON and 
parsed JSON in memory at once

(If it was just thumbnails then I could see some advantages of JSON, but it 
also works on container formats with potentially huge contents)

In terms of recursion, I think it should be off on the default endpoint (as 
now), but with another that supports it. Maybe eg {{/unpack}} and 
{{/unpack/recursive}} ?

> Consider adding a frictionless data package output format
> -
>
> Key: TIKA-3703
> URL: https://issues.apache.org/jira/browse/TIKA-3703
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for 
> thumbnails, or embedded images or embedded files or rendered pages, it would 
> be great to return that data in a standard format. Our current /unpack 
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these 
> byte streams as base64 encoded metadata values in our current metadata 
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server 
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3955) separate dependencies from tika-app-2.6.0-noasm-nojson

2023-01-12 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675914#comment-17675914
 ] 

Nick Burch commented on TIKA-3955:
--

The Tika App is intended as a "batteries included" standalone app.

If you are adding Tika to a Java app, you should add the Java library. Include 
`tika-core` and as many of the `tika-parser-*` parsers as your application 
needs. Doing that via Maven or Gradle will allow you to manage any dependency 
clashes

> separate dependencies from tika-app-2.6.0-noasm-nojson 
> ---
>
> Key: TIKA-3955
> URL: https://issues.apache.org/jira/browse/TIKA-3955
> Project: Tika
>  Issue Type: Wish
>Reporter: Dhoka Pramod
>Priority: Major
>
> Hi Team,
> We are using tika-app-2.6.0-noasm-nojson.jar (uber jar) and it is bundled 
> with all the required third-party jars as mentioned below
> activation-1.1.1.jar
> bcmail-jdk18on-1.72.jar
> bcpkix-jdk18on-1.72.jar
> bcprov-jdk18on-1.72.jar
> byte-buddy-1.12.7.jar
> commons-cli-1.4.jar
> commons-codec-1.15.jar
> commons-collections4-4.1.jar
> commons-compress-1.21.jar
> commons-exec-1.0.jar
> commons-io-2.11.0.jar
> commons-lang3-3.8.1.jar
> commons-logging-1.1.1.jar
> gson-2.9.0.jar
> jackson-core-2.14.0.jar
> jackson-databind-2.14.0.jar
> jaxb-impl-2.1.13.jar
> jaxen-1.1.6.jar
> juniversalchardet-1.0.3.jar
> log4j-api-2.19.0.jar
> log4j-core-2.19.0.jar
> slf4j-api-1.7.36.jar
> xercesImpl.jar
> xmlbeans-3.1.0.jar
> Our application also adds the above jars as it requires. This is leading to 
> duplicate classes on the classpath. Could you provide the tika-app jar 
> (skinny jar) and a list of required dependencies so that we will add them to 
> our application classpath to avoid duplicates.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060
 ] 

Nick Burch commented on TIKA-3952:
--

Is the PDF a scan? Are you doing OCR?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049
 ] 

Nick Burch commented on TIKA-3952:
--

Can you try following the steps in 
[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]
 ?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2536) Move to later edu.ucar version to avoid EOL dependencies

2022-11-02 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627638#comment-17627638
 ] 

Nick Burch commented on TIKA-2536:
--

We can only depend on versions in maven central, we can't depend on versions 
hosted elsewhere

If newer versions have been formally released, ideally the project owners would 
upload them to central. If they can't/won't and we can get that confirmed, we 
may be able to get them uploaded on their behalf, but it's much better and 
easier if the project owners upload themselves! OSSRH is often the best way for 
independent maintainers not part of a bigger foundation to get their releases 
into central.

If the version currently in maven central will play nicely with a new version 
of a dependency, short-term we ought to be able to pull that in and exclude the 
old version. If it doesn't play nicely, our only option is to upgrade the whole 
lot, which needs to be in central

> Move to later edu.ucar version to avoid EOL dependencies
> 
>
> Key: TIKA-2536
> URL: https://issues.apache.org/jira/browse/TIKA-2536
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Priority: Major
>
> The currently referenced 4.5.5 versions of edu.ucar:grib and edu.ucar:cdm 
> (released in Mar 2015), as well as being branch EOL themselves, depend on 
> many other project/branch/version EOL artifacts for which much later and 
> active versions are often available. The list is as follows:
> - edu.ucar:grib depends on the project EOL bzip2. Much more recent versions 
> of edu.ucar:grib exist that no longer depend on bzip2 (note: Jbzip2 is hosted 
> on the Google Code site, which was shut down for active development in 2015.  
> The project was never migrated to another site, e.g. Github).
> - edu.ucar:grib depends on the 2.0.4 EOL version of org.jdom:jdom2
> - edu.ucar:cdm depends on the 2.6.2 branch EOL version of 
> net.sf.ehcache:ehcache-core
> - edu.ucar:cdm depends on the 2.2.0 EOL version of 
> org.quartz-scheduler:quartz for which active versions are available. In turn 
> org.quartz-scheduler:quartz depends on the 0.9.1.1 branch EOL version of 
> c3p0:c3p0. Later versions of quartz have moved to the active com.mchange:c3p0
> - edu.ucar:grib depends on the 2.5.0 branch EOL version of 
> com.google.protobuf:protobuf-java for which active versions are available.
> Request moving to a much later version of edu.ucar, or alternative artifacts 
> to address all the above EOL issues (lack of active support for 
> vulnerabilities and bugs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620633#comment-17620633
 ] 

Nick Burch commented on TIKA-3890:
--

DOCX files are compressed XML. Text compresses very well. Already compressed 
images, audio, video don't.

An 8mb word document of pure text could fairly easily produce a 10x that in 
text. An 8mb word document that's mostly images could produce just a few bytes 
of text

DOCX-specific, you could open the file in POI (use a File to save memory), and 
check the size of the word XML stream and the size of any attachments, that'd 
give you a vague idea. However, it won't give you a complete answer as the word 
XML could have loads of complex stuff in it that doesn't end up with text 
output...

Easiest way to know the size of the output is just to parse it on a beefy 
machine with suitable restarts / respawning in place, and see what you get!

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620610#comment-17620610
 ] 

Nick Burch commented on TIKA-3890:
--

The only way to be sure of how many pages are in a Word document is to render 
it (to screen / PDF / printer)

Some Word files get lucky and have a sensible number in the metadata set by 
Word from when it last opened the file and felt like populating statistics, but 
that's by no means always the case

If you're fairly sure your documents have sensible metadata, you could always 
pre-process with Apache POI. If you provide a File object and only read the 
metadata streams, it's pretty memory efficient to query

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch

On Thu, 6 Oct 2022, Tim Allison wrote:

Happy to chat. Please put them in touch.


Excellent, thanks Tim!

Other than your past talks, have we got any info (eg on the wiki?) about 
how to run the regression corpus?



I've been really impressed with what the POI team has done migrating
from ant to gradle.  On Tika, I don't think we have any special needs
that would require deep gradle knowledge, but given the number of
modules now, it will be non-trivial.  Also, I take Nick D's concerns
seriously.


We don't have to swap from Maven - they have a plugin that integrates it

Nick


Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch

On Wed, 5 Oct 2022, Nicholas DiPiazza wrote:

Are they offering the Gradle Build Cache stuff free for apache projects?


There's an announcement at ApacheCon in about an hour... I think the Infra 
team are still working out the details on how it'll all work.


However, there's an additional offer to Tika for us to get some help on 
our tests, especially the regression run. (I think it's open to other ASF 
projects with "interesting" tests but we're the first ones to ask!)


Nick


Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch

On Wed, 5 Oct 2022, Oleg Tikhonov wrote:

Honestly I am trying to port our project to gradle. But it goes not well.
It is good idea. Is some folk can help, we can do it together.


Apparently Gradle Enterprise works with both Gradle and Maven! So we don't 
even have to change our build -

https://docs.gradle.com/enterprise/maven-extension/

Nick


Possibly speeding up tests with Gradle - anyone interested?

2022-10-05 Thread Nick Burch

Hi All

At ApacheCon this week, a Bob and myself ended up chatting with the folks 
from Gradle, who are keen to help ASF projects, and are discussing with 
the Infra team.


The easier bit - they think they might be able to help speed up our maven 
build, especially the running of tests. Anyone have some time to give that 
a try? Will pass details along to anyone with the volunteer cycles


The interesting bit - we told them about the regression corpus, and they 
got very excited as it sounds completely different to most of their normal 
"my build is slow" type problems. The size of it, and the fact that it 
isn't a simple pass/fail, seemed to catch their interest. Anyone (though 
probably only Tim...) intersted in talking them through how it works, and 
maybe getting one of their team access to the VM?


Cheers
Nick


Re: GUI mods?

2022-09-25 Thread Nick Burch

On Sat, 24 Sep 2022, Tim Allison wrote:

Electron and which framework?


I'd say there's two choice mechanisms.

One is to pick whatever most excites you / is likely to look best on your 
next funding application, and say that since you're doing most of the 
initial work you can choose!


The other is to ask on dev@community and maybe general@incubator, and see 
what other projects have picked for something similar. Means we can maybe 
rope in a few people to help



FWIW For $DAYJOB we use React, and everything we decide we want to do is 
possible, though nothing is every as quick to code as you might hope... 
There's a bunch of other competitors out there, but there's no JS 
framework that wins in all cases!


Nick


Re: GUI mods?

2022-09-24 Thread Nick Burch

On Sat, 24 Sep 2022, Tim Allison wrote:

Given that this is greenfields, should I start w javafx or stick w swing
or is there another framework I should try?


Give the Tika Server an optional snazzy web UI, then wrap it as an 
electron app for people who want a native program to start? (plus avoid a 
bunch of security restrictions that'd apply if run over the web)


Nick


RE: Issue related to file mime type detection

2022-09-15 Thread Nick Burch

On Thu, 15 Sep 2022, Sindhu Mahadevappa wrote:
We have been looking for the latest Tika 2.4.1 jar file, looks like it 
is not available anywhere.


You can get the Tika App and Tika Server jars for 2.4.1 from
https://tika.apache.org/download.html

For the core and parser jars, manually downloading is not recommended as 
you risk missing dependencies. Just ask Maven or Gradle and they'll pull 
the latest jars for you


Nick


[jira] [Commented] (TIKA-3850) Spanish text is incorrectly detected as Galician

2022-09-13 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603483#comment-17603483
 ] 

Nick Burch commented on TIKA-3850:
--

The kind of statistical language model used in Tika struggles with very short 
text. What happens if you feed a longer block of spanish language text in?

> Spanish text is incorrectly detected as Galician
> 
>
> Key: TIKA-3850
> URL: https://issues.apache.org/jira/browse/TIKA-3850
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier
>Affects Versions: 2.4.1
> Environment: org.apache.tika:tika-langdetect-optimaize:2.4.1
> org.apache.tika:tika-core:2.4.1
>Reporter: Lenne Hendrickx
>Priority: Minor
>
> The following Spanish text is incorrectly detected as Galician.
> {noformat}
> Hola! Donde puedo contactar para una garantía?{noformat}
> The es and gl models are loaded into the language detector.
> Language result:
> {noformat}
> language: gl
> score: 0.95{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain

2022-09-12 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603038#comment-17603038
 ] 

Nick Burch commented on TIKA-3308:
--

Our HTML mime type has both root-XML tags for well-formed documents, and a 
bunch of magic for the rest. So, adding some magic as well for these documents 
is in theory possible

Checking for {{http://www.w3.org/2000/svg"}} with a decent priority 
should be fine, but I'm not sure we'd want to look for just {{ SVG file without xml declaration tag is detected as text/plain
> --
>
> Key: TIKA-3308
> URL: https://issues.apache.org/jira/browse/TIKA-3308
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.25
>Reporter: Anas Hammani
>Priority: Minor
> Attachments: logo-luma.svg
>
>
> The SVG file attached to the issue is interpreted as *text/plain* by
> {code:java}
> tika.detect(filePath){code}
>  
> If I add 
> {code:java}
>   {code}
> at the beginning of the file, then tika detects it as  "image/svg+xml"
>  
> When i read the documentation i see that xml is not necessary for a file to 
> be well-formed
> [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd]
>  
> It will be great if tika can detect a file as a SVG without the prolog
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Issue related to file mime type detection

2022-09-09 Thread Nick Burch

On Fri, 9 Sep 2022, Sindhu Mahadevappa wrote:

We are using tika-parsers 1.23


Tika 1.23 was released in December 2019! You should really use something 
much more recent


for comparing uploaded file mime type from file name as well as from 
file content for security purpose.


Apache Tika's detection is not recommended for security purposes. We try 
our best to give an answer. Our detection does not defend against 
specially crafted files which look like one type but is actually a 
different one.


mime type from file name as audio/mp4 and mine type from file content as 
video/mp4 so it is validating as file type not supported.


Try with a more recent version of Apache Tika. Make sure you include the 
Tika Parsers jar and dependencies for container aware detection within MP4 
files. If you still have an issue with Tika 2.4.1, raise a bug and upload 
a triggering file so we can investigate


Nick


[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575814#comment-17575814
 ] 

Nick Burch commented on TIKA-3832:
--

Any chance you could try with Apache PDFBox directly? They've got a handy 
command line tool you can use:

[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]

That will help us narrow down if it's a Tika bug, or one in the underlying 
PDFBox library

> Required array length is too large (OOM) error when reading a PDF file
> --
>
> Key: TIKA-3832
> URL: https://issues.apache.org/jira/browse/TIKA-3832
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Lakatos Gyula
>Priority: Major
> Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf
>
>
> I'm working on a web crawler and it got obliterated with an OutOfMemory error 
> by a random PDF from the internet.
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Required array length 
> 2147483638 + 14 is too large
>   at 
> java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
>   at 
> java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
>   at 
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
>   at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
>   at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
>   at java.base/java.io.StringWriter.write(StringWriter.java:99)
>   at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>   at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
>   at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
> {code}
> I reproduced the error in this repository:
> [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]
> Uploaded the PDF into the attachments as well. It can be opened and read by 
> the PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3830) Kaspersky identified a file as riskware

2022-08-03 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3830.
--
Resolution: Duplicate

> Kaspersky identified a file as riskware
> ---
>
> Key: TIKA-3830
> URL: https://issues.apache.org/jira/browse/TIKA-3830
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app
>Affects Versions: 2.4.1
> Environment: Windows OS
>Reporter: Haralambos Marmanis
>Priority: Major
>
> NOTE: The issue is with component tika-parsers but that doesn't appear in the 
> dropdown list above. 
> Kaspersky +detected and removed+ the following file: quine.gz
> Worth to mention that such file (quine.gz) isn’t malware related but instead 
> has been categorized as a +Risk Ware+ (It infinitely decompress itself).
> File Path: 
> C:\Code\tika-2.4.1\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pkg-module\src\test\resources\test-documents\
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3829) java.lang.IllegalArgumentException: The document is really a XLS file exception while parsing doc file

2022-08-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574656#comment-17574656
 ] 

Nick Burch commented on TIKA-3829:
--

Can you share a file that triggers this bug?

The method in question should only process the summary stream if it exists, so 
something very odd is going on here

> java.lang.IllegalArgumentException: The document is really a XLS file 
> exception while parsing doc file
> --
>
> Key: TIKA-3829
> URL: https://issues.apache.org/jira/browse/TIKA-3829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Dhanabal
>Priority: Major
>
> Getting following exception while parsing doc file:
> WARN  Ignoring unexpected exception while parsing summary entry 
> DocumentSummaryInformation
> java.lang.IllegalArgumentException: The document is really a XLS file
>     at 
> org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:322)
>     at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:82)
>     at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>     at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>     at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  
> What is the meaning of this exception? when it will be thrown?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566991#comment-17566991
 ] 

Nick Burch commented on TIKA-3814:
--

I have a feeling that the Text content handler might rely on these coming 
through in the character stream to nicely-ish format the text output?

I do agree that a custom content handler that tracks if it's inside of the "no 
breaks wanted" tags, and skips newlines in the character stream if so, is 
likely to be the likely-best solution here

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-11 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-3814:
-
Priority: Trivial  (was: Blocker)

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Trivial
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562599#comment-17562599
 ] 

Nick Burch commented on TIKA-3811:
--

Maybe [~tallison] has an idea on the config part, he's been working on that 
area lately...

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562537#comment-17562537
 ] 

Nick Burch commented on TIKA-3811:
--

You should not be using Apache Tika's detection for anything security related. 
We do not protect against people maliciously adding mime magic near the start 
of the file which still allows the underlying file to be processed by the 
correct application. We err on the side of giving a best-guess answer.

For the "what is this probably" case, Tika is great. For the "what parser is 
most likely to manage to get text out" case, Tika is great. For "what is this 
for certain even if it is malicious" you need a different tool for your 
detection.

See also 
[https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika]
 for advice on running Tika with untrusted input

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3810.
--
Fix Version/s: 2.4.2
   Resolution: Fixed

> Vtt file (encoding UTF-8 with BOM) seen as text/plain
> -
>
> Key: TIKA-3810
> URL: https://issues.apache.org/jira/browse/TIKA-3810
> Project: Tika
>  Issue Type: Bug
>  Components: core, detector, mime
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Fix For: 2.4.2
>
> Attachments: s5_windowEncoding_validFormat.vtt
>
>
> Vtt file created on Windows (UTF-8 {+}with BOM{+}) is incorrectly detected as 
> _text/plain_ type and it should be _text/vtt_ .
> The application using Tika and where the file is uploaded for mime type 
> detection is an Unix machine. 
> The vtt file is passed as inputstream to the Tika's default detector (we 
> don't want to detect mime type by the file extension).
> Please find attached the vtt file that Tika is detecting as text/plain .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562532#comment-17562532
 ] 

Nick Burch commented on TIKA-3810:
--

Looks like we had detection magic for the UTF16 variant BOMs but not the UTF8 
one. Fixed in 9d928bbf9

> Vtt file (encoding UTF-8 with BOM) seen as text/plain
> -
>
> Key: TIKA-3810
> URL: https://issues.apache.org/jira/browse/TIKA-3810
> Project: Tika
>  Issue Type: Bug
>  Components: core, detector, mime
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: s5_windowEncoding_validFormat.vtt
>
>
> Vtt file created on Windows (UTF-8 {+}with BOM{+}) is incorrectly detected as 
> _text/plain_ type and it should be _text/vtt_ .
> The application using Tika and where the file is uploaded for mime type 
> detection is an Unix machine. 
> The vtt file is passed as inputstream to the Tika's default detector (we 
> don't want to detect mime type by the file extension).
> Please find attached the vtt file that Tika is detecting as text/plain .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3809) OutOfMemoryError occurs while reading doc file

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562484#comment-17562484
 ] 

Nick Burch commented on TIKA-3809:
--

If the uncompressed XML is 250mb, then you're going to need a heap a lot lot 
bigger than 750mb = 3x the uncompressed size, if you want to use the DOM-based 
parsers. I'd try with about 3gb (so a bit over 10x) and be prepared to go up to 
about 20x uncompressed size for your heap

> OutOfMemoryError occurs while reading doc file
> --
>
> Key: TIKA-3809
> URL: https://issues.apache.org/jira/browse/TIKA-3809
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.23
>Reporter: earl
>Priority: Blocker
>
> OutOfMemoryError occurs while parsing a docx file of size 8 MB (uncompressed 
> size 250 MB). while analyzing the heapdump(.hprof), the thread that parses 
> the file consumes about 750 MB heap size. while looking into a 
> dominator_tree, 
> {code:java}
> org.apache.xmlbeans.impl.store.Xobj$ElementXobj
> {code}
>  This object has been created many times!
> I've also attached the stacktrace,
> {code:java}
> at 
> org.apache.xmlbeans.impl.store.Cur.createElementXobj(Lorg/apache/xmlbeans/impl/store/Locale;Ljavax/xml/namespace/QName;Ljavax/xml/namespace/QName;)Lorg/apache/xmlbeans/impl/store/Xobj;
>  (Cur.java:260)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Ljavax/xml/namespace/QName;)V
>  (Cur.java:2997)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/xml/sax/Attributes;)V
>  (Locale.java:3164)
>   at 
> org.apache.xerces.parsers.AbstractSAXParser.startElement(Lorg/apache/xerces/xni/QName;Lorg/apache/xerces/xni/XMLAttributes;Lorg/apache/xerces/xni/Augmentations;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Lorg/apache/xerces/xni/QName;Lorg/apache/xerces/xni/XMLAttributes;Lorg/apache/xerces/xni/Augmentations;)V
>  (Unknown Source)
>   at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement()Z 
> (Unknown Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Z)Z
>  (Unknown Source)
>   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Z)Z 
> (Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Z)Z (Unknown Source)
>   at 
> org.apache.xerces.parsers.XML11Configuration.parse(Lorg/apache/xerces/xni/parser/XMLInputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.XMLParser.parse(Lorg/apache/xerces/xni/parser/XMLInputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.AbstractSAXParser.parse(Lorg/xml/sax/InputSource;)V 
> (Unknown Source)
>   at 
> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Lorg/xml/sax/InputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Lorg/apache/xmlbeans/impl/store/Locale;Lorg/xml/sax/InputSource;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/impl/store/Cur;
>  (Locale.java:3422)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (Locale.java:1272)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Lorg/apache/xmlbeans/SchemaTypeLoader;Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (Locale.java:1259)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Ljava/io/InputStream;Lorg/apache/xmlbeans/XmlOptions;)Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/DocumentDocument;
>  (Unknown Source)
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead()V 
> (XWPFDocument.java:178)
>   at 
> org.apache.poi.ooxml.POIXMLDocument.load(Lorg/apache/poi/ooxml/POIXMLFactory;)V
>  (POIXMLDocument.java:184)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(Lorg/apache/poi/openxml4j/opc/OPCPackage;)V
>  (XWPFDocument.java:138)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(Lorg/apache/poi/openxml4j/opc/OPCPackage;)V
>  (XWPFWordExtractor.java:60)
>   at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createEx

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557343#comment-17557343
 ] 

Nick Burch commented on TIKA-3798:
--

With no file, no thread dump and no stack trace, it won't be easy to find the 
relevant code in Tika that isn't behaving properly. As everyone working on Tika 
is a volunteer, you're probably going to have to help us a bit more...

Can you talk your client through taking a Java thread dump and get them to 
share it? Can you get the file, run it yourself through Tika to trigger the 
issue and take a thread dump? Can you share the file privately with one of us?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557319#comment-17557319
 ] 

Nick Burch commented on TIKA-3798:
--

Do you have a sample file that shows the problem? A thread dump showing the 
place that Tika gets stuck? Suggestions on how we can reproduce your issue?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552078#comment-17552078
 ] 

Nick Burch commented on TIKA-3768:
--

If we can put something into a properly typed + structured metadata field, we 
will!

The full list of metadata property definitions are spread across the interface 
in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/package-summary.html]
 grouped by type. Wherever possible we re-use existing well known definitions

While we always store the metadata values as strings, the definition properties 
will help you turn it back into the underlying java types, eg get the date back 
as a java Date

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550223#comment-17550223
 ] 

Nick Burch commented on TIKA-3784:
--

We don't currently have any Mime Magic for PKCS12 files

Based on 
[https://stackoverflow.com/questions/33239875/jks-bks-and-pkcs12-file-formats] 
it won't be an easy one to cope with, since we don't currently have an ASN.1 
container detector

I think we can potentially get away with a slightly hacky approach similar to 
the PKCS7 signature, where we look for a few variants and hope the right entry 
comes first... "openssl asn1parse" should help with working out what to look for

(Assuming no-one has a bit of time to knock up an ASN1 container detector based 
on the BouncyCastle ASN.1 using an approach similar to 
[https://stackoverflow.com/questions/10190795/parsing-asn-1-binary-data-with-java]
 )

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550216#comment-17550216
 ] 

Nick Burch commented on TIKA-3768:
--

I wouldn't expect to find those in the textual content after parsing, those 
fields should be ending up in the Metadata object instead

We have a bunch of unit tests for mail parsing which shows that, for our test 
files at least, that subject + from + to all coming through, see 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java]

Are you able to compare your code with that in the unit test, and see any 
differences between the working test and yours? Bonus marks if you can write a 
small failing junit unit test that shows the issue with your file

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-20 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539993#comment-17539993
 ] 

Nick Burch commented on TIKA-3771:
--

The PNG magic is priority 50, which is also what our EML min-match 2 is at. 
That's probably fine for most of them, but \nX- is seemingly too general

I think we probably need to lower the priority on the 0:1024 cases, though I'm 
not sure if we can do that without moving that whole block down?

FWIW your PNG matches because it has a URL followed by a bunch of HTTP response 
headers at the end of it!

> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the  type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594
 ] 

Nick Burch commented on TIKA-3710:
--

As a "normal" html file wouldn't start with these snippets, and they're already 
at a pretty high priority, I think just leave them in the 60 block along with 
the more typical starting tags we have there now

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582
 ] 

Nick Burch commented on TIKA-3710:
--

I was thinking we'd do (open)h1(close) or (open)h1(space) to cover both HTML 
cases but reduce the changes of a false positive match (+h2/h3)

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896
 ] 

Nick Burch commented on TIKA-3710:
--

The h1 isn't quite as unique as we might like, and maybe not as good as some of 
the other ones

How about changing that to  or  HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529977#comment-17529977
 ] 

Nick Burch commented on TIKA-3571:
--

Some formats support the concept of pages and we can pass that along (eg pdf, 
ppt), some don't store page related info in the file format so we can't no 
matter how much people might like us to (eg doc, rtf), and some don't have any 
real concept of a page / are only ever single page (eg jpg, mp3). Potentially 
also the category of ones which don't normally have a concept of a page until 
you try to print (eg xls, ods, CAD formats)

Paged formats are a bit of a special case, but in some systems also a common 
one!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-29 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529918#comment-17529918
 ] 

Nick Burch commented on TIKA-3742:
--

Sure! Potentially easiest is if you create your own fork of Tika on Github, 
create a branch, and work on that. You can then share that branch with us to 
review, feedback on etc. When it's all working, you can then create a pull 
request for us to merge straight into Tika!

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417
 ] 

Nick Burch commented on TIKA-3742:
--

I believe {{readNBytes}} only came in with Java 9, and the particular 
{{readNBytes(int)}} in Java 11, so you'll need to use a newer JVM. Should be 
able to replace it with Commons IO calls once we're happy with the general 
logic + approach

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529101#comment-17529101
 ] 

Nick Burch commented on TIKA-3742:
--

Assuming we just want type=17 text elements of a DGNv7 file (as per 
[http://dgnlib.maptools.org/dgn.html#type17] ) then a quick'n'dirty parser 
wouldn't be too bad 
[https://gist.github.com/Gagravarr/90d390fec7c5f2c5cf966c0eedccac5c] is a basic 
reader that finds these texts elements and prints them

Couldn't immediately spot any useful metadata elements to pull out, so I think 
a basic parser would just be the text for DGN7

Anyone fancy finishing this off into a "proper" Tika parser? :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529038#comment-17529038
 ] 

Nick Burch commented on TIKA-3742:
--

In theory you shouldn't need any java code at all if you don't want, just an 
xml file with a magic well-known name

We've a couple already in Tika, mostly focused on metadata:

[https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]

Pop your own one on the classpath and it should be picked up dynamically at 
runtime

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529029#comment-17529029
 ] 

Nick Burch commented on TIKA-3742:
--

If it can just be run standalone and then {{ExternalParser}} + 
{{tika-external-parsers.xml}} is probably the way to go - that already handles 
testing if the program is installed, spawning it, cleaning up, grabbing text etc

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-26 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528157#comment-17528157
 ] 

Nick Burch commented on TIKA-3731:
--

We already do a prefix for several other formats for custom metadata keys, so 
makes sense to me

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg
>
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-24 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527158#comment-17527158
 ] 

Nick Burch commented on TIKA-3719:
--

Linux and Mac will need quotes around arguments containing spaces. As would 
Windows in the WSL subsystem

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3721) DGN parser

2022-04-23 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526776#comment-17526776
 ] 

Nick Burch commented on TIKA-3721:
--

We already have a few file types which we send to {{OfficeParser}} only for 
common metadata, no content. Project is one such format. As it's better than 
nothing, could always do that for DGN v8 files?

{{SummaryExtractor}} already supports custom properties with the 
{{Office.USER_DEFINED_METADATA_NAME_PREFIX}} prefix so I'd expect those to come 
through if you called OfficeParser (assuming they didn't do something odd and 
put their custom properties in one of the standard streams rather than the 
custom properties stream)

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: Screenshot from 2022-04-22 16-03-44.png, 
> dgn8s-dumped.txt, image-2022-04-22-20-00-45-704.png, 
> image-2022-04-22-20-01-09-564.png, image-2022-04-22-20-02-24-180.png
>
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526352#comment-17526352
 ] 

Nick Burch commented on TIKA-3721:
--

The mime types mentioned at 
[https://communities.bentley.com/products/projectwise/w/wiki/5617/5617] don't 
match our normal convention nor the conventions from other formats, so I'd 
propose

Common base with the globs - {{image/vnd.dgn}}

version 7 - {{image/vnd.dgn;version=7}}

version 8 - {{image/vnd.dgn;version=8}} with an alias of {{image/vnd.dgn;ver=8}}

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526336#comment-17526336
 ] 

Nick Burch commented on TIKA-3721:
--

We've had the OK from the author of the tika-dgn-detector

I'd propose to create a image/vnd.dgn type which gets the globs, then v7 with 
the magic as a subtype and the v8 with no magic which the detector would 
return. That's slightly different to what tika-dgn-detector has though, but 
more in keeping with our other "versions are actually very different kinds of 
files" formats.

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch

Hi Steve

Great to hear, thanks. Apache License v2 is actually the idea one for us 
:)


Thanks
Nick

On Fri, 22 Apr 2022, Steven Frew wrote:

Hi there,

Yeah, do as you please with it.

License-wise, I think that's just something I arbitrarily chose when I was
initially publishing it to Maven, so if it's a problem, feel free to change
or ignore.

Cheers

On Fri, 22 Apr 2022 at 11:57, Nick Burch  wrote:


Hi Steven

Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users
altered us to your tika-dgn-detector github project.

If possible, we'd like to fold the detector logic and mime type
definitions into Tika itself. (Converting it to Java in the process and
putting the detector logic inside our existing POIFS detector)


Would you mind if we did that?

And before we go copying and pasting the key bits out of your project,
could you confirm it's under the Apache License v2 as per the
build.gradle.kts file?

Thanks
Nick





Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch

Hi Steven

Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users 
altered us to your tika-dgn-detector github project.


If possible, we'd like to fold the detector logic and mime type 
definitions into Tika itself. (Converting it to Java in the process and 
putting the detector logic inside our existing POIFS detector)



Would you mind if we did that?

And before we go copying and pasting the key bits out of your project, 
could you confirm it's under the Apache License v2 as per the 
build.gradle.kts file?


Thanks
Nick


[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526324#comment-17526324
 ] 

Nick Burch commented on TIKA-3721:
--

That detector is written in Kotlin, but should be pretty easy to re-implement 
in Java (including it in the exisiting POIFS container detector). I've dropped 
an email to the author of that project to check they're happy with us doing that

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525747#comment-17525747
 ] 

Nick Burch commented on TIKA-3719:
--

Those look like the steps needed. I'd suggest we create ours as something like

{{{color:#445588}keytool{color}{color:#00} -genkeypair -alias 
tika-ssl-testing -keyalg RSA -keysize 2048 -keypass tika-secret -storepass 
tika-secret -validity  -keystore test-ssl.keystore.p12 -storetype PKCS12 
-ext SAN=DNS:localhost,IP:127.0.0.1 -dname "CN=localhost, OU=Tika 
Testing"{color}}}

That will create a PKCS12 formatted keystore with a self-signed key+cert, 
password of tika-secret, which can then be loaded for a test server{{{}{}}}

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-21 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525588#comment-17525588
 ] 

Nick Burch commented on TIKA-3725:
--

Something like OAuth would be pretty different to basic auth, due to the need 
to do all the redirects. SSL client auth would be different again.

Maybe just focus on basic auth with username and password to start with? If so, 
I'd lean towards an interface which takes username + password and returns 
true/false. Then have a single implementation which supports a single username 
and password, username defaults to Tika and can be changed with ENV variable or 
config, password always required from ENV variable or config. Supporting a DB 
of user details (even if only .htpasswd style or like tomcat-users.xml) feels 
an overkill for v1

That's assuming we can't just find some CXF plugin to do it all for us

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525578#comment-17525578
 ] 

Nick Burch commented on TIKA-3719:
--

For testing it, I'd be tempted to create a self-signed certificate for 
localhost valid for eg 30 years, with a well known password, and pop that into 
test/resources. Then have a test that starts the server passing in that, 
verifies it starts and does a call without error with all the ssl validation 
(eg untrusted) turned off. Likely to be simpler than doing it "properly" with a 
test CA issuing a test cert and a test verifying the cert with the CA.

Happy to create such a keystore if it'd help, it'd pretty similar to what you 
need to do for Alfresco+SOLR so I've got notes somewhere on that!

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3721) DGN parser

2022-04-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524718#comment-17524718
 ] 

Nick Burch commented on TIKA-3721:
--

After a quick look, I can't spot any free tools or libraries for working with 
these files. OpenDGN appears to not use our normal sense of open, and seems to 
want an expensive SDK license

Did find a nice document on the DWG file format on the new OpenDGN site - 
[https://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf]
 - but nothing for the DGN format there that I can find

If you're able to locate a tool or library, we can look at adding support. 
Alternately if your company has licensed the SDK, it's fairly easy for you to 
build your own custom Tika parser to wrap it, see 
https://tika.apache.org/2.3.0/parser_guide.html

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517818#comment-17517818
 ] 

Nick Burch commented on TIKA-3571:
--

It has been a quite a while since I last used jodconverter, but the underlying 
OpenOffice would crash or infinite loop rather more often than you'd normally 
like. Docker and a restart watchdog ought to help with that though!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459
 ] 

Nick Burch commented on TIKA-3711:
--

I'd lean towards putting the file name as an attribute of the img tag, along 
with the description as the alt text if the format supports it

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504378#comment-17504378
 ] 

Nick Burch commented on TIKA-3696:
--

Shouldn't it be more like {{application/x-wacz}}  since it isn't a standard / 
official one?

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150
 ] 

Nick Burch commented on TIKA-3684:
--

Same as Tika 2.x - pass a {{--config}} flag when you start the server

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3694.
--
Fix Version/s: 2.3.1
   Resolution: Fixed

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>    Reporter: Nick Burch
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502627#comment-17502627
 ] 

Nick Burch commented on TIKA-3694:
--

I've added new HTML and JSON endpoints {{/mime-types/type/subtype}} which 
return additional details on the specified type (or 404 if unknown), eg
{code:java}
{
  "extensions" : [ ".cbor" ],
  "acronym" : "CBOR",
  "alias" : [ ],
  "description" : "Concise Binary Object Representation container",
  "links" : [ "http://tools.ietf.org/html/rfc7049; ],
  "type" : "application/cbor",
  "defaultExtension" : ".cbor"
}{code}
On the basis that people may have custom parsing around the all-types Text and 
JSON endpoints, no change made there to the output. The all-types HTML endpoint 
now returns a little bit more info, and links to the full details one.

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
>     Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Priority: Major
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
Nick Burch created TIKA-3694:


 Summary: Tika Server endpoint to return more details on a mime type
 Key: TIKA-3694
 URL: https://issues.apache.org/jira/browse/TIKA-3694
 Project: Tika
  Issue Type: Improvement
  Components: mime, server
Affects Versions: 2.3.0
Reporter: Nick Burch


As raised on the user list - 
[https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
calling the Java APIs are able to get additional details on a mime type, such 
as common extensions and descriptions. Those calling the Tika Server can only 
get limited information on mime types, such as which are known to Tika

In addition to the current {{/mime-types}} endpoint (html/json/text), we should 
add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500804#comment-17500804
 ] 

Nick Burch commented on TIKA-3686:
--

Detecting types of text-based files with magic is always going to fail for some 
cases. There are no sure-fire things to match on, only guesses

If you're sure that your files have the right extensions on them, just ask Tika 
to detect by filename only, no contents

> CSS file detected as JavaScript (application/javascript)
> 
>
> Key: TIKA-3686
> URL: https://issues.apache.org/jira/browse/TIKA-3686
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.0.0-ALPHA
>Reporter: Marius Dumitru Florea
>Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3676) Consider making dl4j dependencies provided

2022-02-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489597#comment-17489597
 ] 

Nick Burch commented on TIKA-3676:
--

As long as we provide sensible instructions on what to do, I'm happy to make 
this like our other "large bundle of native code" case for sqlite and require 
users to add the relevant pom entry for their platform / kitchen sink it 
themselves

> Consider making dl4j dependencies provided
> --
>
> Key: TIKA-3676
> URL: https://issues.apache.org/jira/browse/TIKA-3676
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Dl4j dependencies are rather large.  We can cut ~4-6 minutes off the build 
> time and prevent gigabytes transferring over various networks during the 
> release cycle (at least).  With the recent upgrade to dl4j, the jar is now 
> 1.4GB, up from ~800MB in our 1.x branch.
> We are currently packaging the kitchen-sink, e.g. every platform's native 
> libraries.  For folks using our wrappers/parsers around dl4j, they can a) 
> easily include the dependencies that are "provided" or b) tailor their 
> dependencies for their OS/architecture.
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480955#comment-17480955
 ] 

Nick Burch commented on TIKA-3656:
--

That POM is your problem, you aren't including any of the container aware 
dependencies which comes with the Parsers

Try adding a dependency such as tika-parsers-standard or 
tika-parser-microsoft-module

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-21 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479981#comment-17479981
 ] 

Nick Burch commented on TIKA-3656:
--

How are you calling Tika? And do you have the office parsers on your classpath 
along with all their dependencies?

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3646) MP4 files have their mime type detected as video/quicktime

2022-01-13 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475269#comment-17475269
 ] 

Nick Burch commented on TIKA-3646:
--

I think this is probably the same issue as TIKA-2935 - the same work described 
there still needs to be done by someone who has the time + energy + interest...

> MP4 files have their mime type detected as video/quicktime
> --
>
> Key: TIKA-3646
> URL: https://issues.apache.org/jira/browse/TIKA-3646
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Apachae Tika User
>Priority: Major
> Attachments: Video.mp4
>
>
> I was using ScreenToGif tool which allos to record screen and create gifs or 
> MP4 files (with ffmpeg). I've tried to use Tika Detector for such files but 
> the file is being detected as  video/quicktime with .qt extension. How is 
> that?
> Attaching small video for example which was generated with ScreenToGif and 
> saved as mp4.
> I see some other people complaining for same thing here
> [https://stackoverflow.com/questions/48021617/use-apache-tika-get-mp4-file-contenttype-got-video-quicktime]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   6   7   8   9   10   >