[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Peter Kronenberg (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348945#comment-17348945
 ] 

Peter Kronenberg commented on TIKA-3361:


The code already explicitly checks for that.  But I'll add some additional 
range checking

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348848#comment-17348848
 ] 

Tim Allison commented on TIKA-3361:
---

Frankly, as long as we never get a divide by zero exception... :D

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Peter Kronenberg (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348844#comment-17348844
 ] 

Peter Kronenberg commented on TIKA-3361:


No problem, I can take care of it.  What kind of range checks are you thinking 
of? Obviously, the percentages are 0-100.  For the numbers, the minimum is 0.  
What do you think the maximums should be?

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348793#comment-17348793
 ] 

Tim Allison edited comment on TIKA-3361 at 5/20/21, 8:50 PM:
-

How about the PR as is, but change the terms to "faster" and "better"? 

If I wait to future proof this with the other things, it won't get in.  And I 
realize I'm the one holding this up. :(

Let's also add range checks on initialization.

I'm happy to do both of these if you're done with this. :D


was (Author: talli...@mitre.org):
How about the PR as is, but change the terms to "faster" and "better"? 

If we wait to future proof this with the other things, it won't get in.  And I 
realize I'm the one holding this up. :(

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348793#comment-17348793
 ] 

Tim Allison commented on TIKA-3361:
---

How about the PR as is, but change the terms to "faster" and "better"? 

If we wait to future proof this with the other things, it won't get in.  And I 
realize I'm the one holding this up. :(

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Peter Kronenberg (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348689#comment-17348689
 ] 

Peter Kronenberg commented on TIKA-3361:


[~tallison] Still thinking? :)

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348688#comment-17348688
 ] 

Hudson commented on TIKA-3270:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #244 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/244/])
TIKA-3270 -- when rendering a page for OCR, do not include electronic text (as 
default) (tallison: 
[https://github.com/apache/tika/commit/8ec5d4483c953f6024cc470e780037e01530d7dd])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/NoTextPDFRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3408) Apache Tika 1.26 Metadata for MP4 and MP3.

2021-05-20 Thread Danny McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348679#comment-17348679
 ] 

Danny McKinney commented on TIKA-3408:
--

ExifTool Version Number : 12.25
File Name : B1.PC.000161970-10-min.mp4
Directory : .
File Size : 22 MiB
File Modification Date/Time : 2021:05:19 16:10:28-05:00
File Access Date/Time : 2021:05:20 13:28:04-05:00
File Creation Date/Time : 2021:05:19 16:10:00-05:00
File Permissions : -rw-rw-rw-
File Type : MP4
File Type Extension : mp4
MIME Type : video/mp4
Major Brand : MP4 Base Media v1 [IS0 14496-12:2003]
Minor Version : 0.2.0
Compatible Brands : isom, iso2, avc1, mp41
Media Data Size : 22682524
Media Data Offset : 48
Movie Header Version : 0
{color:#ffab00}Create Date : :00:00 00:00:00{color}
{color:#ffab00}Modify Date : :00:00 00:00:00{color}
Time Scale : 1000
Duration : 0:00:50
Preferred Rate : 1
Preferred Volume : 100.00%
Preview Time : 0 s
Preview Duration : 0 s
Poster Time : 0 s
Selection Time : 0 s
Selection Duration : 0 s
Current Time : 0 s
Next Track ID : 3
Track Header Version : 0
{color:#ffab00}Track Create Date : :00:00 00:00:00{color}
{color:#ffab00}Track Modify Date : :00:00 00:00:00{color}
Track ID : 1
Track Duration : 0:00:50
Track Layer : 0
Track Volume : 0.00%
Image Width : 1280
Image Height : 720
Graphics Mode : srcCopy
Op Color : 0 0 0
Compressor ID : avc1
Source Image Width : 1280
Source Image Height : 720
X Resolution : 72
Y Resolution : 72
Bit Depth : 24
Buffer Size : 0
Max Bitrate : 3500685
Average Bitrate : 3500685
Video Frame Rate : 30
Matrix Structure : 1 0 0 0 1 0 0 0 1
Media Header Version : 0
{color:#ffab00}Media Create Date : :00:00 00:00:00{color}
{color:#ffab00}Media Modify Date : :00:00 00:00:00{color}
Media Time Scale : 44100
Media Duration : 0:00:50
Media Language Code : und
Handler Description : Core Media Audio
Balance : 0
Audio Format : mp4a
Audio Channels : 2
Audio Bits Per Sample : 16
Audio Sample Rate : 44100
Handler Type : Metadata
Handler Vendor ID : Apple
Encoder : Lavf58.76.100
Image Size : 1280x720
Megapixels : 0.922
Avg Bitrate : 3.63 Mbps
Rotation : 0

 

That above date times are actually set to 0 in the metadata for the mp4 file. 
Tika actually seems to set the values as the old Mac Classic Epoch date which 
is "1904-01-01T00:00:00Z".  Is there anyway to change this default behavior? 
The following is output from sample project along with code from sample project:

 

May 20, 2021 1:37:00 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Lavf58.76.100

date: 1904-01-01T00:00:00Z
X-Parsed-By: org.apache.tika.parser.DefaultParser
xmp:CreatorTool: Lavf58.76.100
{color:#ffab00}meta:creation-date: 1904-01-01T00:00:00Z{color}
{color:#ffab00}Creation-Date: 1904-01-01T00:00:00Z{color}
tiff:ImageLength: 720
{color:#ffab00}dcterms:created: 1904-01-01T00:00:00Z{color}
{color:#ffab00}dcterms:modified: 1904-01-01T00:00:00Z{color}
{color:#ffab00}Last-Modified: 1904-01-01T00:00:00Z{color}
{color:#ffab00}Last-Save-Date: 1904-01-01T00:00:00Z{color}
xmpDM:audioSampleRate: 1000
{color:#ffab00}meta:save-date: 1904-01-01T00:00:00Z{color}
{color:#ffab00}modified: 1904-01-01T00:00:00Z{color}
tiff:ImageWidth: 1280
xmpDM:duration: 50.0
Content-Type: video/mp4

BUILD SUCCESSFUL in 5s
2 actionable tasks: 2 executed
1:37:03 PM: Task execution finished 'Main.main()'.

 

Program: 

===

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class Main {
 public static void main(String[] args) {
 File file = new File(".\\Data\\B1.PC.000161970-10-min.mp4");

 // Parser method parameters
 Parser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 try {
 FileInputStream is = new FileInputStream(file);
 parser.parse(is, handler, metadata, context);
 System.out.println(handler);
 //getting the list of all meta data elements
 String[] metadataNames = metadata.names();
 for(String name : metadataNames) {
 System.out.println(name + ": " + metadata.get(name));
 }

 } catch (TikaException | IOException | SAXException e) {
 e.printStackTrace();
 }

 }
}

 

Version of Tika Core and Parsers used was 1.26.

> Apache Tika 1.26 Metadata for MP4 and MP3.
> --
>
> Key: 

[jira] [Reopened] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-3270:
---

Have to rework the logic a bit.  The rendering strategy default is "render with 
no text and then run OCR" as the default.  However, we should make the default 
a bit smarter...an AUTO rendering mode.

If you're in AUTO (OCR mode)  and OCR is triggered because of missing unicode 
code points, then you'd want to run OCR on everything. If it is triggered 
because of too few characters, then you'd still want to run OCR on everything.

If you're in OCR_ONLY mode, you'd want to run OCR on everything (or maybe 
_only_ the text?)

If you're in TEXT_AND_OCR mode, you'd want OCR on the not-text bits.



> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348657#comment-17348657
 ] 

Tim Allison commented on TIKA-3270:
---

This is a breaking change.  The default is now to render the page without text 
when running OCR.  We can revisit this if this causes havoc.

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3270.
---
Fix Version/s: 2.0.0
 Assignee: Tim Allison
   Resolution: Fixed

This is now in 2.0.0.  I realize that we might want to add an render text only 
strategy for cases where there's electronic text but the unicode mappings are 
broken... This may need further tweaks, but the rendering without text was 
easy.  Thank you [~tilman] and [~lfcnassif]!

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348620#comment-17348620
 ] 

Tim Allison commented on TIKA-3270:
---

[~tilman] it really is that easy! :D

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3270:
--
Attachment: tiger-no-text.png

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3270) Render non-text in PDFs for OCR

2021-05-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3270:
--
Attachment: test-no-text.png
test.png

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: test-no-text.png, test.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3410) Clean up logging in PipesServer

2021-05-20 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348597#comment-17348597
 ] 

Hudson commented on TIKA-3410:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #243 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/243/])
TIKA-3410 -- implement actual logging in PipesServer (tallison: 
[https://github.com/apache/tika/commit/da1ba84c32d53f7d9d42da0356a9b4b76b275e5e])
* (edit) tika-core/pom.xml
* (add) 
tika-pipes/tika-pipes-integration-tests/src/test/resources/pipes-fork-server-custom-log4j2.xml
* (add) tika-core/src/main/resources/pipes-fork-server-default-log4j2.xml
* (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTikaBinTest.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesClient.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java
* (edit) 
tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/solrtest/TikaPipesSolrTestBase.java


> Clean up logging in PipesServer
> ---
>
> Key: TIKA-3410
> URL: https://issues.apache.org/jira/browse/TIKA-3410
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should clean up logging to allow actual logging in the PipesServer.  We 
> should include a default log4j2.xml file for PipesServer, but allow users to 
> configure their logging as they want via 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Recording/Streaming Apache Tika Virtual Meetings to YouTube

2021-05-20 Thread Rich Bowen



On 2021/05/20 09:07:16, Bertrand Delacretaz  wrote: 
..
> > I am not aware of any such service (or tool) that is provided by ASF
> > to the projects to host the meetings...
> 
> I think this is changing, our conferences team might soon be able to
> provide conferencing services for projects to host virtual events.

Our conferences platform (Hopin - where we ran ApacheCon) is ideal for things 
that look like conferences - multi-presentation, possibly multi-day. There are 
tools for "audience" interaction, but it is primarily a presenter/audience 
dynamic.

Which is not a "no", but that 1) it might not be the right tool for a community 
meeting and 2) it might not be cost-effective for a monthly/weekly meeting kind 
of scenario. Each "event" is a standalone thing, and each iteration of the 
meeting would have to be created, separately, as an "event", and people would 
have to register for *that* event. That is, each person has to register (and I 
have to pay) each week/month.


Re: Recording/Streaming Apache Tika Virtual Meetings to YouTube

2021-05-20 Thread Swapnil M Mane
Great, thank you Bertrand and Sally!
Lewis, wishing best to the Tika community and you for the event!

Best Regards,
Swapnil M Mane,
www.apache.org

On Thu, May 20, 2021 at 3:11 PM Sally Khudairi  wrote:
>
> +1; thank you, Bertrand.
>
> If memory serves me correctly, we recently hosted a Cassandra event using our 
> Hopin account. The turnout was more than 4x than anticipated.
>
> Here's hoping you'll have a great Tika community event! Once you have the 
> recordings, we'll be happy to help post to the ASF's YouTube channel.
>
> Best,
> Sally
>
> - - -
> Vice President Marketing & Publicity
> Vice President Sponsor Relations
> The Apache Software Foundation
>
> Tel +1 617 921 8656 | s...@apache.org
>
> On Thu, May 20, 2021, at 05:07, Bertrand Delacretaz wrote:
> > Hi,
> >
> > On Thu, May 20, 2021 at 10:51 AM Swapnil M Mane  
> > wrote:
> > > On Wed, May 19, 2021 at 11:27 PM lewis john mcgibbney
> > >  wrote:
> > > > ...The meeting was hosted on a paid version of WebEx. It would be great 
> > > > if we could move away from this for the next meeting.
> > > >
> > >
> > > I am not aware of any such service (or tool) that is provided by ASF
> > > to the projects to host the meetings...
> >
> > I think this is changing, our conferences team might soon be able to
> > provide conferencing services for projects to host virtual events.
> >
> > Lewis, I suggest you get in touch with plann...@apachecon.com for more
> > info about that.
> >
> > -Bertrand
> >


[jira] [Commented] (TIKA-3408) Apache Tika 1.26 Metadata for MP4 and MP3.

2021-05-20 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348193#comment-17348193
 ] 

Nick Burch commented on TIKA-3408:
--

I'm not sure what you mean by an epoch date here, and I can't see any reference 
to it in your sample program. Can you please clarify what metadata field you 
are / aren't seeing the date in, and what date you are expecting there vs what 
you get?

> Apache Tika 1.26 Metadata for MP4 and MP3.
> --
>
> Key: TIKA-3408
> URL: https://issues.apache.org/jira/browse/TIKA-3408
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.26
>Reporter: Danny McKinney
>Priority: Minor
>
> Currently parser is returning incorrect date information form mp3 files and 
> mpeg4 files. Our sample is returning date fields with epoch date values which 
> start at 1904. Also the mp3 file is not returning date value although one is 
> part of the header information. I have attached sample program and data files.
>  
> Files (Upload Did not Work): 
> [https://drive.google.com/file/d/1qQmRcqABkwfrR1uuO_m3scXl2dKzxtv4/view?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Recording/Streaming Apache Tika Virtual Meetings to YouTube

2021-05-20 Thread Sally Khudairi
+1; thank you, Bertrand.

If memory serves me correctly, we recently hosted a Cassandra event using our 
Hopin account. The turnout was more than 4x than anticipated.

Here's hoping you'll have a great Tika community event! Once you have the 
recordings, we'll be happy to help post to the ASF's YouTube channel.

Best,
Sally

- - - 
Vice President Marketing & Publicity
Vice President Sponsor Relations
The Apache Software Foundation

Tel +1 617 921 8656 | s...@apache.org

On Thu, May 20, 2021, at 05:07, Bertrand Delacretaz wrote:
> Hi,
> 
> On Thu, May 20, 2021 at 10:51 AM Swapnil M Mane  
> wrote:
> > On Wed, May 19, 2021 at 11:27 PM lewis john mcgibbney
> >  wrote:
> > > ...The meeting was hosted on a paid version of WebEx. It would be great 
> > > if we could move away from this for the next meeting.
> > >
> >
> > I am not aware of any such service (or tool) that is provided by ASF
> > to the projects to host the meetings...
> 
> I think this is changing, our conferences team might soon be able to
> provide conferencing services for projects to host virtual events.
> 
> Lewis, I suggest you get in touch with plann...@apachecon.com for more
> info about that.
> 
> -Bertrand
> 


Re: Recording/Streaming Apache Tika Virtual Meetings to YouTube

2021-05-20 Thread Bertrand Delacretaz
Hi,

On Thu, May 20, 2021 at 10:51 AM Swapnil M Mane  wrote:
> On Wed, May 19, 2021 at 11:27 PM lewis john mcgibbney
>  wrote:
> > ...The meeting was hosted on a paid version of WebEx. It would be great if 
> > we could move away from this for the next meeting.
> >
>
> I am not aware of any such service (or tool) that is provided by ASF
> to the projects to host the meetings...

I think this is changing, our conferences team might soon be able to
provide conferencing services for projects to host virtual events.

Lewis, I suggest you get in touch with plann...@apachecon.com for more
info about that.

-Bertrand


Re: Recording/Streaming Apache Tika Virtual Meetings to YouTube

2021-05-20 Thread Swapnil M Mane
Hi Lewis,
Please find my comments inline.

On Wed, May 19, 2021 at 11:27 PM lewis john mcgibbney
 wrote:
>
> Hi Swapnil,
> Excellent., Thank you. Replies inline below
>
> On Wed, May 19, 2021 at 9:53 AM Swapnil M Mane  
> wrote:
>>
>>
>> If it is a community meetup where the participant has active
>> involvement in conversation, we should not go for YouTube live.
>
>
> It IS a community meetup participants actively engage in and trade 
> conversation and opinions. So it sounds like YouTube live is not the correct 
> solution.

+1

>
>>
>> One of the popular tool used for live streams is Streamyard. You can
>> find more details here [1].
>
>
> I had never heard of it, thanks for the pointer.
>
>>
>>
>> By the way, which tool community used for the last meeting (Zoom,
>> Google meet or something else)?
>
>
> The meeting was hosted on a paid version of WebEx. It would be great if we 
> could move away from this for the next meeting.
>

I am not aware of any such service (or tool) that is provided by ASF
to the projects to host the meetings.

> lewismc