from:"ASF GitHub Bot \(Jira\)"

[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-05-20 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843830#comment-16843830
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler 
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-493923095
 
 
   17:33:28.423 [main] ERROR net.sourceforge.tess4j.Tesseract - Unsupported 
image format. May need to install JAI Image I/O package.
   https://github.com/jai-imageio/jai-imageio-core
   java.lang.RuntimeException: Unsupported image format. May need to install 
JAI Image I/O package.
   https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:214)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.java:20)
   17:33:28.423 [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - 
java.lang.RuntimeException: Unsupported image format. May need to install JAI 
Image I/O package.
   https://github.com/jai-imageio/jai-imageio-core
   net.sourceforge.tess4j.TesseractException: java.lang.RuntimeException: 
Unsupported image format. May need to install JAI Image I/O package.
   https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:245)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.ja

[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-05-20 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844488#comment-16844488
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler 
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-494233047
 
 
   I found that the main reason for this problem was that when tika parsed the 
embedded image in word, it was finally parsed into a temporary file with the 
suffix of TMP and sent to tess4j for identification, which was not recognized 
by tess4j.So how to modify the type generated by tika to adapt to tess4j? Can 
you give me some ideas
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Thejan Wijesinghe
>Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2019-05-24 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847514#comment-16847514
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

saitho commented on issue #255: TIKA-2630: Wrong height and width metadata for 
JPEG images
URL: https://github.com/apache/tika/pull/255#issuecomment-495606025
 
 
   What's the status on this? :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2888) Add wmv2 codec detection to ASF container

2019-05-31 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853398#comment-16853398
 ] 

ASF GitHub Bot commented on TIKA-2888:
--

avendasora commented on pull request #272: TIKA-2888 Add wmv2 codec detection 
for WMV files
URL: https://github.com/apache/tika/pull/272
 
 
   Changes to enable correct detection of .wmv files (ASF Container files) with 
video tracks encoded using the "wmv2" codec. Previously, their MediaType was 
incorrectly detected as `audio/x-ms-wma` instead of `video/x-ms-wmv`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add wmv2 codec detection to ASF container
> -
>
> Key: TIKA-2888
> URL: https://issues.apache.org/jira/browse/TIKA-2888
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: David Avendasora
>Priority: Major
>  Labels: codec, container, easyfix, video
> Attachments: Video1.WMV, sample.wmv
>
>
> Attached file are .wmv file (ASF Container) with a video tracks encoded using 
> the {{WMV2}} codec. They are incorrectly detected as audio 
> ({{audio/x-ms-wma}}) file instead of video ({{video/x-ms-wmv}}) file. 
> Adding the following line to the {{tiki-mimetypes.xml}} file fixes the issue:
> {{    }}
> Test Files:
>  * [http://techslides.com/demos/samples/sample.wmv]
>  * [http://www.lehman.edu/faculty/hoffmann/itc/techteach/video/Video1.WMV] 
> Related to TIKA-939
> I will submit a pull request with the above changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2896) NullPointerException in MimeTypesReader.releaseParser()

2019-06-17 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865970#comment-16865970
 ] 

ASF GitHub Bot commented on TIKA-2896:
--

dannysmyda commented on pull request #274: TIKA-2896 Null check before 
releasing parser in MimeTypesReader
URL: https://github.com/apache/tika/pull/274
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NullPointerException in MimeTypesReader.releaseParser()
> ---
>
> Key: TIKA-2896
> URL: https://issues.apache.org/jira/browse/TIKA-2896
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.21
>Reporter: Eamonn Saunders
>Priority: Major
>
> We have encountered a situation where the call to parser.reset() in the 
> following code snippet results in a NullPointerException.
> {code:java}
>     private static void releaseParser(SAXParser parser) {
>     try {
>     parser.reset();
>     } catch (UnsupportedOperationException e) {
>     //ignore
>     }
> {code}
> releaseParser() called in the finally block of MimeTypesReader.read()
> {code:java}
>     public void read(InputStream stream) throws IOException, 
> MimeTypeException {
>     SAXParser parser = null;
>     try {
>     parser = acquireSAXParser();
>     parser.parse(stream, this);
>     } catch (TikaException e) {
>     throw new MimeTypeException("Unable to create an XML parser", e);
>     } catch (SAXException e) {
>     throw new MimeTypeException("Invalid type configuration", e);
>     } finally {
>     releaseParser(parser);
>     }
>     }{code}
> The parser variable will be null coming out of acquireSAXParser() if 
> acquireSAXParser() is called on a thread that is interrupted (i.e. the 
> InterruptedException is handled in the following code:
> {code:java}
>     private static SAXParser acquireSAXParser()
>     throws TikaException {
>     while (true) {
>     SAXParser parser = null;
>     try {
>     READ_WRITE_LOCK.readLock().lock();
>     parser = SAX_PARSERS.poll(10, TimeUnit.MILLISECONDS);
>     } catch (InterruptedException e) {
>     throw new TikaException("interrupted while waiting for 
> SAXParser", e);
>     } finally {
>     READ_WRITE_LOCK.readLock().unlock();
>     }
>     if (parser != null) {
>     return parser;
>     }
>     }
>     }{code}
> A simple fix would be to check for null before calling releaseParser() in the 
> finally block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2896) NullPointerException in MimeTypesReader.releaseParser()

2019-06-18 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866357#comment-16866357
 ] 

ASF GitHub Bot commented on TIKA-2896:
--

sberyozkin commented on pull request #274: TIKA-2896 Null check before 
releasing parser in MimeTypesReader
URL: https://github.com/apache/tika/pull/274
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NullPointerException in MimeTypesReader.releaseParser()
> ---
>
> Key: TIKA-2896
> URL: https://issues.apache.org/jira/browse/TIKA-2896
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.21
>Reporter: Eamonn Saunders
>Priority: Major
>
> We have encountered a situation where the call to parser.reset() in the 
> following code snippet results in a NullPointerException.
> {code:java}
>     private static void releaseParser(SAXParser parser) {
>     try {
>     parser.reset();
>     } catch (UnsupportedOperationException e) {
>     //ignore
>     }
> {code}
> releaseParser() is called in the finally block of MimeTypesReader.read()
> {code:java}
>     public void read(InputStream stream) throws IOException, 
> MimeTypeException {
>     SAXParser parser = null;
>     try {
>     parser = acquireSAXParser();
>     parser.parse(stream, this);
>     } catch (TikaException e) {
>     throw new MimeTypeException("Unable to create an XML parser", e);
>     } catch (SAXException e) {
>     throw new MimeTypeException("Invalid type configuration", e);
>     } finally {
>     releaseParser(parser);
>     }
>     }{code}
> The parser variable will be null coming out of acquireSAXParser() if 
> acquireSAXParser() is called on a thread that is interrupted (i.e. the 
> InterruptedException is handled in the following code):
> {code:java}
>     private static SAXParser acquireSAXParser()
>     throws TikaException {
>     while (true) {
>     SAXParser parser = null;
>     try {
>     READ_WRITE_LOCK.readLock().lock();
>     parser = SAX_PARSERS.poll(10, TimeUnit.MILLISECONDS);
>     } catch (InterruptedException e) {
>     throw new TikaException("interrupted while waiting for 
> SAXParser", e);
>     } finally {
>     READ_WRITE_LOCK.readLock().unlock();
>     }
>     if (parser != null) {
>     return parser;
>     }
>     }
>     }{code}
> A simple fix would be to check for null before calling releaseParser() in the 
> finally block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2922) Regression issue with detecting .dotx and .xlam MS Office mime-types

2019-08-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904905#comment-16904905
 ] 

ASF GitHub Bot commented on TIKA-2922:
--

essiembre commented on pull request #279: Fix for TIKA-2922 contributed by 
pascal.essiembre
URL: https://github.com/apache/tika/pull/279
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Regression issue with detecting .dotx and .xlam MS Office mime-types
> 
>
> Key: TIKA-2922
> URL: https://issues.apache.org/jira/browse/TIKA-2922
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.22
> Environment: N/A
>Reporter: Pascal Essiembre
>Priority: Minor
>
> After upgrading to 1.22, .dotx and .xlam files are no longer detected 
> properly. 
> They are now detected as:
>  
> {noformat}
> .dotx -> vnd.ms-word.template.macroenabled.12
> .xlam -> application/x-tika-ooxml{noformat}
>  
> They should be detected like they originally were: 
> {noformat}
> .dotx -> vnd.openxmlformats-officedocument.wordprocessingml.template
> .xlam -> application/vnd.ms-excel.addin.macroenabled.12{noformat}
> Reference: 
> [https://docs.microsoft.com/en-us/previous-versions/office/office-2007-resource-kit/ee309278(v=office.12)]
> It is happening in StreamingZipContainerDetector and ZipContainerDetectorBase.
> I will submit a pull request shortly with the correct mapping.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2922) Regression issue with detecting .dotx and .xlam MS Office mime-types

2019-08-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905423#comment-16905423
 ] 

ASF GitHub Bot commented on TIKA-2922:
--

tballison commented on issue #279: Fix for TIKA-2922 contributed by 
pascal.essiembre
URL: https://github.com/apache/tika/pull/279#issuecomment-520522861
 
 
   Thank you @essiembre !
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Regression issue with detecting .dotx and .xlam MS Office mime-types
> 
>
> Key: TIKA-2922
> URL: https://issues.apache.org/jira/browse/TIKA-2922
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.22
> Environment: N/A
>Reporter: Pascal Essiembre
>Priority: Minor
>
> After upgrading to 1.22, .dotx and .xlam files are no longer detected 
> properly. 
> They are now detected as:
>  
> {noformat}
> .dotx -> vnd.ms-word.template.macroenabled.12
> .xlam -> application/x-tika-ooxml{noformat}
>  
> They should be detected like they originally were: 
> {noformat}
> .dotx -> vnd.openxmlformats-officedocument.wordprocessingml.template
> .xlam -> application/vnd.ms-excel.addin.macroenabled.12{noformat}
> Reference: 
> [https://docs.microsoft.com/en-us/previous-versions/office/office-2007-resource-kit/ee309278(v=office.12)]
> It is happening in StreamingZipContainerDetector and ZipContainerDetectorBase.
> I will submit a pull request shortly with the correct mapping.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2922) Regression issue with detecting .dotx and .xlam MS Office mime-types

2019-08-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905422#comment-16905422
 ] 

ASF GitHub Bot commented on TIKA-2922:
--

tballison commented on pull request #279: Fix for TIKA-2922 contributed by 
pascal.essiembre
URL: https://github.com/apache/tika/pull/279
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Regression issue with detecting .dotx and .xlam MS Office mime-types
> 
>
> Key: TIKA-2922
> URL: https://issues.apache.org/jira/browse/TIKA-2922
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.22
> Environment: N/A
>Reporter: Pascal Essiembre
>Priority: Minor
>
> After upgrading to 1.22, .dotx and .xlam files are no longer detected 
> properly. 
> They are now detected as:
>  
> {noformat}
> .dotx -> vnd.ms-word.template.macroenabled.12
> .xlam -> application/x-tika-ooxml{noformat}
>  
> They should be detected like they originally were: 
> {noformat}
> .dotx -> vnd.openxmlformats-officedocument.wordprocessingml.template
> .xlam -> application/vnd.ms-excel.addin.macroenabled.12{noformat}
> Reference: 
> [https://docs.microsoft.com/en-us/previous-versions/office/office-2007-resource-kit/ee309278(v=office.12)]
> It is happening in StreamingZipContainerDetector and ZipContainerDetectorBase.
> I will submit a pull request shortly with the correct mapping.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2931) Tika CLI shouldn't log with System.out.println

2019-08-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918731#comment-16918731
 ] 

ASF GitHub Bot commented on TIKA-2931:
--

tballison commented on issue #281: TIKA-2931
URL: https://github.com/apache/tika/pull/281#issuecomment-526248388
 
 
   I'm good with this (once squashed, which I can do at merge time).
   
   I suspect this code was added before we had logging.  I wonder if we should 
use LOG.info() instead, and change the unit test to check for the 
existence/size of the output file. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika CLI shouldn't log with System.out.println
> --
>
> Key: TIKA-2931
> URL: https://issues.apache.org/jira/browse/TIKA-2931
> Project: Tika
>  Issue Type: Improvement
>Reporter: Eric Pugh
>Assignee: Tim Allison
>Priority: Minor
>
> Running Tika-app on the command line, I expect to get back the output on 
> STDOUT to be a single JSON response, with logging going to STDERR, which is 
> what happens except if you have a embedded image then there is what I think 
> is a stray System.out.println:
> https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L1054
> This causes my output to be a mix of regular text and JSON!  See below 
> example.
> Extracting 'image0.tif' (image/tiff) to ./image0.tif
> [
>   {
> "Author": "Federal Reserve Board",
> "Content-Length": "345888"
>   }
> ]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (TIKA-2931) Tika CLI shouldn't log with System.out.println

2019-09-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922426#comment-16922426
 ] 

ASF GitHub Bot commented on TIKA-2931:
--

tballison commented on pull request #281: TIKA-2931
URL: https://github.com/apache/tika/pull/281
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika CLI shouldn't log with System.out.println
> --
>
> Key: TIKA-2931
> URL: https://issues.apache.org/jira/browse/TIKA-2931
> Project: Tika
>  Issue Type: Improvement
>Reporter: Eric Pugh
>Assignee: Tim Allison
>Priority: Minor
>
> Running Tika-app on the command line, I expect to get back the output on 
> STDOUT to be a single JSON response, with logging going to STDERR, which is 
> what happens except if you have a embedded image then there is what I think 
> is a stray System.out.println:
> https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L1054
> This causes my output to be a mix of regular text and JSON!  See below 
> example.
> Extracting 'image0.tif' (image/tiff) to ./image0.tif
> [
>   {
> "Author": "Federal Reserve Board",
> "Content-Length": "345888"
>   }
> ]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (TIKA-2949) Update Jackson to 2.9.10

2019-09-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937917#comment-16937917
 ] 

ASF GitHub Bot commented on TIKA-2949:
--

coheigea commented on pull request #283: TIKA-2949 - Update Jackson to 2.9.10
URL: https://github.com/apache/tika/pull/283
 
 
   Jackson should be updated to the latest 2.9.10 version to pick up fixes for 
some CVEs (e.g. CVE-2019-14540)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Jackson to 2.9.10
> 
>
> Key: TIKA-2949
> URL: https://issues.apache.org/jira/browse/TIKA-2949
> Project: Tika
>  Issue Type: Bug
>Reporter: Colm O hEigeartaigh
>Priority: Major
> Fix For: 2.0.0
>
>
> Jackson should be updated to the latest 2.9.10 version to pick up fixes for 
> some CVEs (e.g. CVE-2019-14540)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2954) Remove Magic Numbers from ImageMetadataExtractor

2019-10-02 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942824#comment-16942824
 ] 

ASF GitHub Bot commented on TIKA-2954:
--

chriszappia commented on pull request #284: Fix for TIKA-2954 contributed by 
chriszappia
URL: https://github.com/apache/tika/pull/284
 
 
   See https://issues.apache.org/jira/browse/TIKA-2954
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Magic Numbers from ImageMetadataExtractor
> 
>
> Key: TIKA-2954
> URL: https://issues.apache.org/jira/browse/TIKA-2954
> Project: Tika
>  Issue Type: Improvement
>Reporter: Chris Z
>Priority: Trivial
>
> There are magic numbers used in 
> {{ImageMetadataExtractor.TiffPageNumberHandler}}, and a comment suggesting 
> they be removed once the MetadataExtractor dependency has been updated.
> Since this has happened, these can be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946386#comment-16946386
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

LukeButters commented on pull request #285: Fix for TIKA-2955 filter out 
invalid HTML characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948024#comment-16948024
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

LukeButters commented on issue #285: Fix for TIKA-2955 filter out invalid HTML 
characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285#issuecomment-540217923
 
 
   @tballison the PR for the invalid HTML chars
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writin

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948147#comment-16948147
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

tballison commented on pull request #285: Fix for TIKA-2955 filter out invalid 
HTML characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948191#comment-16948191
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

LukeButters commented on issue #285: Fix for TIKA-2955 filter out invalid HTML 
characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285#issuecomment-540346949
 
 
   thanks Tim. Do I now need to do something to cherrypick it into a some 1.X 
version?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Fix For: 1.23
>
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, how

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948193#comment-16948193
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

tballison commented on issue #285: Fix for TIKA-2955 filter out invalid HTML 
characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285#issuecomment-540348989
 
 
   I took care of it in branch_1x. You’ve done plenty. Thank you!
   
   On Wed, Oct 9, 2019 at 9:22 PM Luke Butters 
   wrote:
   
   > thanks Tim. Do I now need to do something to cherrypick it into a some 1.X
   > version?
   >
   > —
   > You are receiving this because you modified the open/close state.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or unsubscribe
   > 

   > .
   >
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Fix For: 1.23
>
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949041#comment-16949041
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

LukeButters commented on issue #285: Fix for TIKA-2955 filter out invalid HTML 
characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285#issuecomment-540860215
 
 
   ah neat, does a date exist for when that will be released?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Fix For: 1.23
>
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably w

[jira] [Commented] (TIKA-2949) Update Jackson to 2.9.10

2019-10-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950247#comment-16950247
 ] 

ASF GitHub Bot commented on TIKA-2949:
--

alexott commented on issue #283: TIKA-2949 - Update Jackson to 2.9.10
URL: https://github.com/apache/tika/pull/283#issuecomment-541400442
 
 
   it's already in the master.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Jackson to 2.9.10
> 
>
> Key: TIKA-2949
> URL: https://issues.apache.org/jira/browse/TIKA-2949
> Project: Tika
>  Issue Type: Bug
>Reporter: Colm O hEigeartaigh
>Priority: Major
> Fix For: 2.0.0
>
>
> Jackson should be updated to the latest 2.9.10 version to pick up fixes for 
> some CVEs (e.g. CVE-2019-14540)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2964) Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs

2019-10-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950254#comment-16950254
 ] 

ASF GitHub Bot commented on TIKA-2964:
--

alexott commented on pull request #287: [TIKA-2964] Upgrade Jackson Databind to 
2.10.0 to fix latest CVEs
URL: https://github.com/apache/tika/pull/287
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs
> 
>
> Key: TIKA-2964
> URL: https://issues.apache.org/jira/browse/TIKA-2964
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Alex Ott
>Priority: Major
>
> When compiling the latest version of the source code, following error is 
> reported:
> {noformat}
> [ERROR] Failed to execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.4:audit 
> (audit-dependencies) on project tika-parsers: Detected 1 vulnerable 
> components:
> [ERROR]   com.fasterxml.jackson.core:jackson-databind:jar:2.9.10:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.10
> [ERROR] * [CVE-2019-16943] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/f4f0c103-c9d9-4308-bd8f-489f2a632680
> [ERROR] * [CVE-2019-16942] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/07632245-fcef-4eb3-82b6-aadbbfd2b33e
> {noformat}
> We need to bump version after the 2.9.10.1 is released or consider switching 
> to 2.10 that isn't vulnerable...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2953) Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.

2019-10-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950294#comment-16950294
 ] 

ASF GitHub Bot commented on TIKA-2953:
--

alexott commented on pull request #288: TIKA-2953 bump version of 
commons-compress to fix CVE
URL: https://github.com/apache/tika/pull/288
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.  
> -
>
> Key: TIKA-2953
> URL: https://issues.apache.org/jira/browse/TIKA-2953
> Project: Tika
>  Issue Type: Bug
>Reporter: Aman Mishra
>Priority: Major
>
> We can see that commons-compress with version 1.18 is present in tika-bundle 
> 1.22 jar. We can see that latest commons-compress with version 1.19 is not 
> vulnerable.
>  
> So please confirm your side that "Is this vulnerability CVE-2019-12402 is 
> impacting to tika or not ?"
> And can we upgrade this library (commons-compress : 1.18) to latest version 
> 1.19 locally after downloading the source code of tika ? Is there any 
> challenge for this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2949) Update Jackson to 2.9.10

2019-10-14 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950810#comment-16950810
 ] 

ASF GitHub Bot commented on TIKA-2949:
--

coheigea commented on pull request #283: TIKA-2949 - Update Jackson to 2.9.10
URL: https://github.com/apache/tika/pull/283
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Jackson to 2.9.10
> 
>
> Key: TIKA-2949
> URL: https://issues.apache.org/jira/browse/TIKA-2949
> Project: Tika
>  Issue Type: Bug
>Reporter: Colm O hEigeartaigh
>Priority: Major
> Fix For: 2.0.0
>
>
> Jackson should be updated to the latest 2.9.10 version to pick up fixes for 
> some CVEs (e.g. CVE-2019-14540)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956463#comment-16956463
 ] 

ASF GitHub Bot commented on TIKA-2624:
--

epugh commented on issue #232: Fix for TIKA-2624 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/232#issuecomment-544718260
 
 
   Interesting patch, do you have any examples of this being an issue that you 
can share?   I never really thought about having them be the same.   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2581) testOCROutputsHOCR fails with Tesseract 4.0

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956468#comment-16956468
 ] 

ASF GitHub Bot commented on TIKA-2581:
--

epugh commented on issue #221: Fix for TIKA-2581 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/221#issuecomment-544719955
 
 
   At this point, does it make sense to support Tesseract3 when running tests?  
  Maybe update the documentation 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR that the output format 
is slightly different?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> testOCROutputsHOCR fails with Tesseract 4.0
> ---
>
> Key: TIKA-2581
> URL: https://issues.apache.org/jira/browse/TIKA-2581
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Minor
>
> TesseractOCRParserTest.testOCROutputsHOCR fails with Tesseract 4.0.
> With 3.x, the output is Happy but with 4.0 the output is 
> Happy.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956572#comment-16956572
 ] 

ASF GitHub Bot commented on TIKA-2624:
--

ewanmellor commented on issue #232: Fix for TIKA-2624 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/232#issuecomment-544766754
 
 
   I don't have examples that I can share, sorry.  This work was for a 
proprietary product, and I am no longer working on it.
   
   I remember it making an absolutely enormous difference in OCR quality though 
(with Tesseract 4).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2019-10-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957364#comment-16957364
 ] 

ASF GitHub Bot commented on TIKA-2624:
--

tballison commented on pull request #232: Fix for TIKA-2624 contributed by 
ewanmellor.
URL: https://github.com/apache/tika/pull/232
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2953) Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958104#comment-16958104
 ] 

ASF GitHub Bot commented on TIKA-2953:
--

tballison commented on issue #288: TIKA-2953 bump version of commons-compress 
to fix CVE
URL: https://github.com/apache/tika/pull/288#issuecomment-545567624
 
 
   Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.  
> -
>
> Key: TIKA-2953
> URL: https://issues.apache.org/jira/browse/TIKA-2953
> Project: Tika
>  Issue Type: Bug
>Reporter: Aman Mishra
>Priority: Major
>
> We can see that commons-compress with version 1.18 is present in tika-bundle 
> 1.22 jar. We can see that latest commons-compress with version 1.19 is not 
> vulnerable.
>  
> So please confirm your side that "Is this vulnerability CVE-2019-12402 is 
> impacting to tika or not ?"
> And can we upgrade this library (commons-compress : 1.18) to latest version 
> 1.19 locally after downloading the source code of tika ? Is there any 
> challenge for this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2953) Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958103#comment-16958103
 ] 

ASF GitHub Bot commented on TIKA-2953:
--

tballison commented on pull request #288: TIKA-2953 bump version of 
commons-compress to fix CVE
URL: https://github.com/apache/tika/pull/288
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Vulnerable "commons-compress : 1.18" is present in tika-bundle 1.22.  
> -
>
> Key: TIKA-2953
> URL: https://issues.apache.org/jira/browse/TIKA-2953
> Project: Tika
>  Issue Type: Bug
>Reporter: Aman Mishra
>Priority: Major
>
> We can see that commons-compress with version 1.18 is present in tika-bundle 
> 1.22 jar. We can see that latest commons-compress with version 1.19 is not 
> vulnerable.
>  
> So please confirm your side that "Is this vulnerability CVE-2019-12402 is 
> impacting to tika or not ?"
> And can we upgrade this library (commons-compress : 1.18) to latest version 
> 1.19 locally after downloading the source code of tika ? Is there any 
> challenge for this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2964) Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958108#comment-16958108
 ] 

ASF GitHub Bot commented on TIKA-2964:
--

tballison commented on pull request #287: [TIKA-2964] Upgrade Jackson Databind 
to 2.10.0 to fix latest CVEs
URL: https://github.com/apache/tika/pull/287
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs
> 
>
> Key: TIKA-2964
> URL: https://issues.apache.org/jira/browse/TIKA-2964
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Alex Ott
>Priority: Major
>
> When compiling the latest version of the source code, following error is 
> reported:
> {noformat}
> [ERROR] Failed to execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.4:audit 
> (audit-dependencies) on project tika-parsers: Detected 1 vulnerable 
> components:
> [ERROR]   com.fasterxml.jackson.core:jackson-databind:jar:2.9.10:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.10
> [ERROR] * [CVE-2019-16943] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/f4f0c103-c9d9-4308-bd8f-489f2a632680
> [ERROR] * [CVE-2019-16942] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/07632245-fcef-4eb3-82b6-aadbbfd2b33e
> {noformat}
> We need to bump version after the 2.9.10.1 is released or consider switching 
> to 2.10 that isn't vulnerable...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2964) Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958107#comment-16958107
 ] 

ASF GitHub Bot commented on TIKA-2964:
--

tballison commented on issue #287: [TIKA-2964] Upgrade Jackson Databind to 
2.10.0 to fix latest CVEs
URL: https://github.com/apache/tika/pull/287#issuecomment-545568253
 
 
   Done. Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Jackson Databind dependency to 2.9.10.1 or 2.10.0 to fix latest CVEs
> 
>
> Key: TIKA-2964
> URL: https://issues.apache.org/jira/browse/TIKA-2964
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Alex Ott
>Priority: Major
>
> When compiling the latest version of the source code, following error is 
> reported:
> {noformat}
> [ERROR] Failed to execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.4:audit 
> (audit-dependencies) on project tika-parsers: Detected 1 vulnerable 
> components:
> [ERROR]   com.fasterxml.jackson.core:jackson-databind:jar:2.9.10:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.10
> [ERROR] * [CVE-2019-16943] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/f4f0c103-c9d9-4308-bd8f-489f2a632680
> [ERROR] * [CVE-2019-16942] A Polymorphic Typing issue was discovered in 
> FasterXML jackson-databind 2.0.0 th... (0.0); 
> https://ossindex.sonatype.org/vuln/07632245-fcef-4eb3-82b6-aadbbfd2b33e
> {noformat}
> We need to bump version after the 2.9.10.1 is released or consider switching 
> to 2.10 that isn't vulnerable...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2954) Remove Magic Numbers from ImageMetadataExtractor

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958109#comment-16958109
 ] 

ASF GitHub Bot commented on TIKA-2954:
--

tballison commented on pull request #284: Fix for TIKA-2954 contributed by 
chriszappia
URL: https://github.com/apache/tika/pull/284
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Magic Numbers from ImageMetadataExtractor
> 
>
> Key: TIKA-2954
> URL: https://issues.apache.org/jira/browse/TIKA-2954
> Project: Tika
>  Issue Type: Improvement
>Reporter: Chris Z
>Priority: Trivial
>
> There are magic numbers used in 
> {{ImageMetadataExtractor.TiffPageNumberHandler}}, and a comment suggesting 
> they be removed once the MetadataExtractor dependency has been updated.
> Since this has happened, these can be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2954) Remove Magic Numbers from ImageMetadataExtractor

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958110#comment-16958110
 ] 

ASF GitHub Bot commented on TIKA-2954:
--

tballison commented on issue #284: Fix for TIKA-2954 contributed by chriszappia
URL: https://github.com/apache/tika/pull/284#issuecomment-545568851
 
 
   Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Magic Numbers from ImageMetadataExtractor
> 
>
> Key: TIKA-2954
> URL: https://issues.apache.org/jira/browse/TIKA-2954
> Project: Tika
>  Issue Type: Improvement
>Reporter: Chris Z
>Priority: Trivial
>
> There are magic numbers used in 
> {{ImageMetadataExtractor.TiffPageNumberHandler}}, and a comment suggesting 
> they be removed once the MetadataExtractor dependency has been updated.
> Since this has happened, these can be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2019-10-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958116#comment-16958116
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

tballison commented on issue #255: TIKA-2630: Wrong height and width metadata 
for JPEG images
URL: https://github.com/apache/tika/pull/255#issuecomment-545569640
 
 
   Unless there are objections, let's put this in 1.23?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2975) XLIFF 1.2 Parser

2019-10-26 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960478#comment-16960478
 ] 

ASF GitHub Bot commented on TIKA-2975:
--

dameikle commented on pull request #293: TIKA-2975: Add parser for XLIFF v1.2 
files
URL: https://github.com/apache/tika/pull/293
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> XLIFF 1.2 Parser
> 
>
> Key: TIKA-2975
> URL: https://issues.apache.org/jira/browse/TIKA-2975
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
>
> Basic parser for XLIFF 1.2 files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2975) XLIFF 1.2 Parser

2019-10-26 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960480#comment-16960480
 ] 

ASF GitHub Bot commented on TIKA-2975:
--

dameikle commented on pull request #293: TIKA-2975: Add parser for XLIFF v1.2 
files
URL: https://github.com/apache/tika/pull/293
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> XLIFF 1.2 Parser
> 
>
> Key: TIKA-2975
> URL: https://issues.apache.org/jira/browse/TIKA-2975
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
>
> Basic parser for XLIFF 1.2 files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2019-10-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960544#comment-16960544
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

dameikle commented on issue #255: TIKA-2630: Wrong height and width metadata 
for JPEG images
URL: https://github.com/apache/tika/pull/255#issuecomment-546676220
 
 
   @tballison - I agree, let's go with this one
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2900) Removing comments from .docx, .pdf files

2019-10-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960546#comment-16960546
 ] 

ASF GitHub Bot commented on TIKA-2900:
--

dameikle commented on pull request #294: TIKA-2900: Add ability to exclude 
comments in Word extraction
URL: https://github.com/apache/tika/pull/294
 
 
   Adds an option in OfficeParserConfig to allow comments to be explicitly 
included or excluded in Word extractions. The default remains as included for 
backwards compatibility.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, 
> Document_with_Comments_Text_extarction_Tika_APP.docx.txt
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2894) Add support for WebAssembly (Content-Type application/wasm, or .wasm extension)

2019-10-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961518#comment-16961518
 ] 

ASF GitHub Bot commented on TIKA-2894:
--

dameikle commented on pull request #295: TIKA-2894: Add mime type detection 
support for WebAssembly
URL: https://github.com/apache/tika/pull/295
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add support for WebAssembly (Content-Type application/wasm, or .wasm 
> extension)
> ---
>
> Key: TIKA-2894
> URL: https://issues.apache.org/jira/browse/TIKA-2894
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: Fredrik Söderström
>Assignee: Dave Meikle
>Priority: Major
>
> Right now I cannot find any support for wasm (WebAssembly) files, I need to 
> add an external if statement in my spring boot project.
> {quote}String path = resource.getFile().getPath();
> if (path.endsWith(".wasm")) {
>   servletResponse.setContentType("application/wasm");
> } else {
>   servletResponse.setContentType(tika.detect(path));
> }
> {quote}
> It would be nice to add support for this format as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2894) Add support for WebAssembly (Content-Type application/wasm, or .wasm extension)

2019-10-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961531#comment-16961531
 ] 

ASF GitHub Bot commented on TIKA-2894:
--

dameikle commented on pull request #295: TIKA-2894: Add mime type detection 
support for WebAssembly
URL: https://github.com/apache/tika/pull/295
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add support for WebAssembly (Content-Type application/wasm, or .wasm 
> extension)
> ---
>
> Key: TIKA-2894
> URL: https://issues.apache.org/jira/browse/TIKA-2894
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: Fredrik Söderström
>Assignee: Dave Meikle
>Priority: Major
>
> Right now I cannot find any support for wasm (WebAssembly) files, I need to 
> add an external if statement in my spring boot project.
> {quote}String path = resource.getFile().getPath();
> if (path.endsWith(".wasm")) {
>   servletResponse.setContentType("application/wasm");
> } else {
>   servletResponse.setContentType(tika.detect(path));
> }
> {quote}
> It would be nice to add support for this format as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2976) Add an XLZ parser

2019-10-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961894#comment-16961894
 ] 

ASF GitHub Bot commented on TIKA-2976:
--

dameikle commented on pull request #296: TIKA-2976: Add an XLZ Parser
URL: https://github.com/apache/tika/pull/296
 
 
   Add's basic support for XLZ achives that uses the XLIFF parser to process 
the internal content.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add an XLZ parser
> -
>
> Key: TIKA-2976
> URL: https://issues.apache.org/jira/browse/TIKA-2976
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
>
> Add an XLZ parser that processes the embedded XLF content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2976) Add an XLZ parser

2019-10-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961895#comment-16961895
 ] 

ASF GitHub Bot commented on TIKA-2976:
--

dameikle commented on pull request #296: TIKA-2976: Add an XLZ Parser
URL: https://github.com/apache/tika/pull/296
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add an XLZ parser
> -
>
> Key: TIKA-2976
> URL: https://issues.apache.org/jira/browse/TIKA-2976
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
>
> Add an XLZ parser that processes the embedded XLF content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2996) Add dropThreshold to PDFParserConfig

2019-11-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980267#comment-16980267
 ] 

ASF GitHub Bot commented on TIKA-2996:
--

fsonntag commented on pull request #297: fix for TIKA-2996 by fsonntag
URL: https://github.com/apache/tika/pull/297
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add dropThreshold to PDFParserConfig
> 
>
> Key: TIKA-2996
> URL: https://issues.apache.org/jira/browse/TIKA-2996
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Felix Sonntag
>Priority: Minor
>
> {{PDFTextStripper}} has the property {{dropThreshold}}, which currently 
> cannot be set when using the Tika {{PDFParser}}. Other properties of 
> {{PDFTextStripper}} can be set over the PDFParserConfig, so it makes sense to 
> also add this as a setting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2996) Add dropThreshold to PDFParserConfig

2019-11-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980405#comment-16980405
 ] 

ASF GitHub Bot commented on TIKA-2996:
--

tballison commented on pull request #297: fix for TIKA-2996 contributed by 
fsonntag
URL: https://github.com/apache/tika/pull/297
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add dropThreshold to PDFParserConfig
> 
>
> Key: TIKA-2996
> URL: https://issues.apache.org/jira/browse/TIKA-2996
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Felix Sonntag
>Priority: Minor
>
> {{PDFTextStripper}} has the property {{dropThreshold}}, which currently 
> cannot be set when using the Tika {{PDFParser}}. Other properties of 
> {{PDFTextStripper}} can be set over the PDFParserConfig, so it makes sense to 
> also add this as a setting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3001) Throw TaggedIOException when we open the HWP file with the Tika-App GUI

2019-11-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982210#comment-16982210
 ] 

ASF GitHub Bot commented on TIKA-3001:
--

tandara0 commented on pull request #298: fix for TIKA-3001 contributed by 
tandara0
URL: https://github.com/apache/tika/pull/298
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Throw TaggedIOException when we open the HWP file with the Tika-App GUI
> ---
>
> Key: TIKA-3001
> URL: https://issues.apache.org/jira/browse/TIKA-3001
> Project: Tika
>  Issue Type: Bug
>  Components: gui
>Affects Versions: 1.22
>Reporter: Kim Ju Young
>Priority: Major
> Fix For: 1.23
>
> Attachments: F.hwp
>
>
> When we open the attached HWP file with the Tika-App GUI, it throws 
> TaggedIOException.
>  The full exception stack trace is included below:
>  org.apache.tika.io.TaggedIOException at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at 
> org.apache.tika.io.BoundedInputStream.read(BoundedInputStream.java:78) at 
> org.apache.tika.parser.digest.InputStreamDigester.updateDigest(InputStreamDigester.java:198)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digestStream(InputStreamDigester.java:179)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digest(InputStreamDigester.java:133)
>  at 
> org.apache.tika.parser.digest.CompositeDigester.digest(CompositeDigester.java:46)
>  at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:82) at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)
>  at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:406) at 
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:310) at 
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:268) at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022) at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348) 
> at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>  at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at 
> javax.swing.AbstractButton.doClick(AbstractButton.java:376) at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842) at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>  at java.awt.Component.processMouseEvent(Component.java:6539) at 
> javax.swing.JComponent.processMouseEvent(JComponent.java:3324) at 
> java.awt.Component.processEvent(Component.java:6304) at 
> java.awt.Container.processEvent(Container.java:2239) at 
> java.awt.Component.dispatchEventImpl(Component.java:4889) at 
> java.awt.Container.dispatchEventImpl(Container.java:2297) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904) at 
> java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535) at 
> java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476) at 
> java.awt.Container.dispatchEventImpl(Container.java:2283) at 
> java.awt.Window.dispatchEventImpl(Window.java:2746) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760) at 
> java.awt.EventQueue.access$500(EventQueue.java:97) at 
> java.awt.EventQueue$3.run(EventQueue.java:709) at 
> java.awt.EventQueue$3.run(EventQueue.java:703) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>  at java.awt.EventQueue$4.run(EventQueue.java:733) at 
> java.awt.EventQueue$4.run(EventQueue.java:731) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at java.awt.EventQueue.dispatchEvent(EventQueue.java:730) at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>  at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>  at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
>  at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at 
> j

[jira] [Commented] (TIKA-3001) Throw TaggedIOException when we open the HWP file with the Tika-App GUI

2019-11-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982212#comment-16982212
 ] 

ASF GitHub Bot commented on TIKA-3001:
--

tandara commented on issue #298: fix for TIKA-3001 contributed by tandara0
URL: https://github.com/apache/tika/pull/298#issuecomment-558500070
 
 
   Prevent the POIFSFileSystem from closing InputStream insides by wrapping 
CloseShieldInputStream.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Throw TaggedIOException when we open the HWP file with the Tika-App GUI
> ---
>
> Key: TIKA-3001
> URL: https://issues.apache.org/jira/browse/TIKA-3001
> Project: Tika
>  Issue Type: Bug
>  Components: gui
>Affects Versions: 1.22
>Reporter: Kim Ju Young
>Priority: Major
> Fix For: 1.23
>
> Attachments: F.hwp
>
>
> When we open the attached HWP file with the Tika-App GUI, it throws 
> TaggedIOException.
>  The full exception stack trace is included below:
>  org.apache.tika.io.TaggedIOException at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at 
> org.apache.tika.io.BoundedInputStream.read(BoundedInputStream.java:78) at 
> org.apache.tika.parser.digest.InputStreamDigester.updateDigest(InputStreamDigester.java:198)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digestStream(InputStreamDigester.java:179)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digest(InputStreamDigester.java:133)
>  at 
> org.apache.tika.parser.digest.CompositeDigester.digest(CompositeDigester.java:46)
>  at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:82) at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)
>  at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:406) at 
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:310) at 
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:268) at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022) at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348) 
> at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>  at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at 
> javax.swing.AbstractButton.doClick(AbstractButton.java:376) at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842) at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>  at java.awt.Component.processMouseEvent(Component.java:6539) at 
> javax.swing.JComponent.processMouseEvent(JComponent.java:3324) at 
> java.awt.Component.processEvent(Component.java:6304) at 
> java.awt.Container.processEvent(Container.java:2239) at 
> java.awt.Component.dispatchEventImpl(Component.java:4889) at 
> java.awt.Container.dispatchEventImpl(Container.java:2297) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904) at 
> java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535) at 
> java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476) at 
> java.awt.Container.dispatchEventImpl(Container.java:2283) at 
> java.awt.Window.dispatchEventImpl(Window.java:2746) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760) at 
> java.awt.EventQueue.access$500(EventQueue.java:97) at 
> java.awt.EventQueue$3.run(EventQueue.java:709) at 
> java.awt.EventQueue$3.run(EventQueue.java:703) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>  at java.awt.EventQueue$4.run(EventQueue.java:733) at 
> java.awt.EventQueue$4.run(EventQueue.java:731) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at java.awt.EventQueue.dispatchEvent(EventQueue.java:730) at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>  at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>  at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(Ev

[jira] [Commented] (TIKA-3001) Throw TaggedIOException when we open the HWP file with the Tika-App GUI

2019-11-26 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982350#comment-16982350
 ] 

ASF GitHub Bot commented on TIKA-3001:
--

tballison commented on pull request #298: fix for TIKA-3001 contributed by 
tandara0
URL: https://github.com/apache/tika/pull/298
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Throw TaggedIOException when we open the HWP file with the Tika-App GUI
> ---
>
> Key: TIKA-3001
> URL: https://issues.apache.org/jira/browse/TIKA-3001
> Project: Tika
>  Issue Type: Bug
>  Components: gui
>Affects Versions: 1.22
>Reporter: Kim Ju Young
>Priority: Major
> Fix For: 1.23
>
> Attachments: F.hwp
>
>
> When we open the attached HWP file with the Tika-App GUI, it throws 
> TaggedIOException.
>  The full exception stack trace is included below:
>  org.apache.tika.io.TaggedIOException at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at 
> org.apache.tika.io.BoundedInputStream.read(BoundedInputStream.java:78) at 
> org.apache.tika.parser.digest.InputStreamDigester.updateDigest(InputStreamDigester.java:198)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digestStream(InputStreamDigester.java:179)
>  at 
> org.apache.tika.parser.digest.InputStreamDigester.digest(InputStreamDigester.java:133)
>  at 
> org.apache.tika.parser.digest.CompositeDigester.digest(CompositeDigester.java:46)
>  at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:82) at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)
>  at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:406) at 
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:310) at 
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:268) at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022) at 
> javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348) 
> at 
> javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
>  at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at 
> javax.swing.AbstractButton.doClick(AbstractButton.java:376) at 
> javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842) at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886)
>  at java.awt.Component.processMouseEvent(Component.java:6539) at 
> javax.swing.JComponent.processMouseEvent(JComponent.java:3324) at 
> java.awt.Component.processEvent(Component.java:6304) at 
> java.awt.Container.processEvent(Container.java:2239) at 
> java.awt.Component.dispatchEventImpl(Component.java:4889) at 
> java.awt.Container.dispatchEventImpl(Container.java:2297) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904) at 
> java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535) at 
> java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476) at 
> java.awt.Container.dispatchEventImpl(Container.java:2283) at 
> java.awt.Window.dispatchEventImpl(Window.java:2746) at 
> java.awt.Component.dispatchEvent(Component.java:4711) at 
> java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760) at 
> java.awt.EventQueue.access$500(EventQueue.java:97) at 
> java.awt.EventQueue$3.run(EventQueue.java:709) at 
> java.awt.EventQueue$3.run(EventQueue.java:703) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
>  at java.awt.EventQueue$4.run(EventQueue.java:733) at 
> java.awt.EventQueue$4.run(EventQueue.java:731) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
>  at java.awt.EventQueue.dispatchEvent(EventQueue.java:730) at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
>  at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
>  at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
>  at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at 
>

[jira] [Commented] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984536#comment-16984536
 ] 

ASF GitHub Bot commented on TIKA-3003:
--

cesarsotovalero commented on pull request #299: fix for TIKA-3003 contributed 
by cesarsotovalero
URL: https://github.com/apache/tika/pull/299
 
 
   I'm making this pull reques because I noticed that dependency 
`org.jsoup:jsoup:1.12.1` is declared in module `tika-parsers` to prevent from 
having a vulnerable version from edu.ucar:grib. However, this dependency is not 
used and, therefore, it can be removed to make the ` pom` clearer and the 
dependency tree of this module less complex.
   
   In addition, dependency `net.sf.ehcache:ehcache-core`, induced transitively 
from `edu.ucar:cdm:4.5.5`, is not used and can be excluded safely. Notice that 
the size of the `jar` of `ehcache-core` is around 1.3MB, thus removing it has a 
positive impact on the size of module `tika-parsers`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove unused dependencies
> --
>
> Key: TIKA-3003
> URL: https://issues.apache.org/jira/browse/TIKA-3003
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: César Soto Valero
>Priority: Minor
> Fix For: 2.0.0
>
>
> I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module 
> *tika-parsers*  to prevent from having a vulnerable version from 
> *edu.ucar:grib*. However, this dependency is not used and, therefore, it can 
> be removed to make the pom clearer and the dependency tree of this module 
> complex.
> In addition, dependency *net.sf.ehcache:ehcache-core*, induced transitively 
> from *edu.ucar:cdm:4.5.5*, is not used and can be excluded safely. Notice 
> that the size of the jar of *ehcache-core* is around 1.3MB, thus removing it 
> has a positive impact on the size of *tika-parsers*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2019-12-02 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986268#comment-16986268
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

tballison commented on pull request #255: TIKA-2630: Wrong height and width 
metadata for JPEG images
URL: https://github.com/apache/tika/pull/255
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992729#comment-16992729
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

tballison commented on issue #300: TIKA-2224 - OneNote parser
URL: https://github.com/apache/tika/pull/300#issuecomment-564126279
 
 
   Need to review statics to make sure this parser will be thread safe.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992736#comment-16992736
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

tballison commented on issue #300: TIKA-2224 - OneNote parser
URL: https://github.com/apache/tika/pull/300#issuecomment-564126279
 
 
   Need to review statics to make sure this parser will be thread safe.
   Remove json unless critical.
   
   I'm working on these and a few of the above now.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992734#comment-16992734
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

tballison commented on issue #300: TIKA-2224 - OneNote parser
URL: https://github.com/apache/tika/pull/300#issuecomment-564126279
 
 
   Need to review statics to make sure this parser will be thread safe.
   Remove json unless critical.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992881#comment-16992881
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

tballison commented on pull request #300: TIKA-2224 - OneNote parser
URL: https://github.com/apache/tika/pull/300
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-12 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994865#comment-16994865
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

nddipiazza commented on pull request #303: TIKA-2224 OneNote parser support
URL: https://github.com/apache/tika/pull/303
 
 
   # OneNote parser
   
   The following adds `.one` file format parsing support. 
   `application/onenote; format=one`
   
   Supports embedded documents as well. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2019-12-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997545#comment-16997545
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

chrismattmann commented on issue #305: WIP: TIKA-3010 Install and run 
Tika-Server as a Service
URL: https://github.com/apache/tika/pull/305#issuecomment-566190422
 
 
   LGTM!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2019-12-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997666#comment-16997666
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

epugh commented on issue #305: WIP: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-566260935
 
 
   Thanks @chrismattmann I suspect that adding this in will flush out some 
other deploy related issues, however excited to have this done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-12-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997668#comment-16997668
 ] 

ASF GitHub Bot commented on TIKA-2224:
--

tballison commented on pull request #303: TIKA-2224 OneNote parser support
URL: https://github.com/apache/tika/pull/303
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote formats support - Mime Magic and Parser
> ---
>
> Key: TIKA-2224
> URL: https://issues.apache.org/jira/browse/TIKA-2224
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Attachments: Sample1.json, Sample1.one, note-ssn-test-.one
>
>
> As raised at 
> http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers,
>  we don't have any magic for the OneNote formats. Several years ago we dug 
> out the file format specs (see 
> http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but 
> didn't have volunteer energy to implement a parser. However, armed with those 
> specs, we should be able to come up with some mime magic for detection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2019-12-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997672#comment-16997672
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

tballison commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-566267570
 
 
   Oh, this is great!  Let me play with it a bit.
   
   Any objection to making -spawnChild the default?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2019-12-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998423#comment-16998423
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

epugh commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-566674939
 
 
   I think that is a great idea.   I wanted to not get *too* ambitious, 
however, please push your changes…. I think I set the branch up so you can push 
changes if you want…
   
   
   
   > On Dec 16, 2019, at 5:11 PM, Tim Allison  wrote:
   > 
   > Oh, this is great! Let me play with it a bit.
   > 
   > Any objection to making -spawnChild the default?
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub 
,
 or unsubscribe 
.
   > 
   
   ___
   Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
   Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

 
   This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3014) XLIFF12Parser fails with ToXMLHandler

2019-12-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999581#comment-16999581
 ] 

ASF GitHub Bot commented on TIKA-3014:
--

dameikle commented on pull request #306: TIKA-3014: Update to fix XLIFF12Parser 
failures with ToXMLHandler
URL: https://github.com/apache/tika/pull/306
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> XLIFF12Parser fails with ToXMLHandler 
> --
>
> Key: TIKA-3014
> URL: https://issues.apache.org/jira/browse/TIKA-3014
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> XLIFF12Parser fails with ToXMLHandler because xml namespace isn't set, but is 
> needed for "xml:lang".
> One option would be to remove the namespace on the lang attribute?
> [~dmeikle], any recommendations?
> To see the problem:
> 1) Make XLIFF12ParserTest extend TikaTest
> 2) add this test:
> {noformat}
> @Test
> public void testToXMLHandler() throws Exception {
> String xml = getXML("testXLIFF12.xlf").xml;
> assertContains("Another trans-unit", xml);
> assertContains("Un autre trans-unit", xml);
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2020-01-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010131#comment-17010131
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

epugh commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-571795129
 
 
   @tballison i've introduced the `-spawnChild` as a default option, and you 
can mess around with what you specfically want via the `TIKA_SPAWN_CHILD_OPTS` 
setting.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2020-02-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028935#comment-17028935
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

tballison commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-581414001
 
 
   @epugh any last commits?  Is this ready to go?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2020-02-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030191#comment-17030191
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

epugh commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-582157359
 
 
   It’s ready for commit!!!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2020-02-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030196#comment-17030196
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

tballison commented on pull request #305: TIKA-3010 Install and run Tika-Server 
as a Service
URL: https://github.com/apache/tika/pull/305
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2020-02-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030197#comment-17030197
 ] 

ASF GitHub Bot commented on TIKA-3010:
--

tballison commented on issue #305: TIKA-3010 Install and run Tika-Server as a 
Service
URL: https://github.com/apache/tika/pull/305#issuecomment-582160103
 
 
   Thank you, @epugh!!!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030942#comment-17030942
 ] 

ASF GitHub Bot commented on TIKA-3037:
--

epugh commented on pull request #310: TIKA-3037 Update docs for Tika Server
URL: https://github.com/apache/tika/pull/310
 
 
   introduce the Docker image, bump some versions, link to the right place on 
the wiki.
   
   I changed the versions to all be 1.24, but maybe everything needs to be 
2.0-SNAPSHOT??
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032033#comment-17032033
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

epugh commented on pull request #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311
 
 
   Per mailing list discussion, remove mvn goal.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033837#comment-17033837
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

epugh commented on issue #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311#issuecomment-584277746
 
 
   @dameikle any chance of a review?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033839#comment-17033839
 ] 

ASF GitHub Bot commented on TIKA-3037:
--

epugh commented on issue #310: TIKA-3037 Update docs for Tika Server
URL: https://github.com/apache/tika/pull/310#issuecomment-584278297
 
 
   @tballison thoughts on these changes?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034513#comment-17034513
 ] 

ASF GitHub Bot commented on TIKA-3037:
--

tballison commented on pull request #310: TIKA-3037 Update docs for Tika Server
URL: https://github.com/apache/tika/pull/310
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043788#comment-17043788
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

tballison commented on issue #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311#issuecomment-590515079
 
 
   @dameikle unless you have objections, I'll merge this shortly?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043789#comment-17043789
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

tballison commented on issue #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311#issuecomment-590515154
 
 
   @epugh would you mind fixing the conflicts...sorry!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044951#comment-17044951
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

epugh commented on issue #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311#issuecomment-591099155
 
 
   Okay @tballison it was a small chagne, ready for merge.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044961#comment-17044961
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

tballison commented on issue #311: TIKA-3039 Remove dockerfile:build mvn goal
URL: https://github.com/apache/tika/pull/311#issuecomment-591103494
 
 
   Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3039) Remove mvn dockerfile:build goal from tika-server

2020-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044960#comment-17044960
 ] 

ASF GitHub Bot commented on TIKA-3039:
--

tballison commented on pull request #311: TIKA-3039 Remove dockerfile:build mvn 
goal
URL: https://github.com/apache/tika/pull/311
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove mvn dockerfile:build goal from tika-server
> -
>
> Key: TIKA-3039
> URL: https://issues.apache.org/jira/browse/TIKA-3039
> Project: Tika
>  Issue Type: Task
>  Components: server
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Per 
> https://lucene.472066.n3.nabble.com/Do-we-have-a-community-supported-approach-for-deploying-Tika-Server-in-production-tp4453263p4455044.html
>  drop the mvn dockerfile:build goal.
> It's a unsupported plugin, and we have a new tika-docker project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3077) OneNote parser - very inefficient when parsing OneNote <= 2007 files

2020-03-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066317#comment-17066317
 ] 

ASF GitHub Bot commented on TIKA-3077:
--

nddipiazza commented on pull request #314: address TIKA-3077 - very slow 
parsing performance on OneNote <= 2007 docs.
URL: https://github.com/apache/tika/pull/314
 
 
   The OneNote 2007 code I created neglected to realize that there was no byte 
buffer on the direct file resource utility. So when I was setting the position 
on the stream over and over again during the parsing of bytes for the OneNote 
2007 parsing, it was extremely inefficient. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote parser - very inefficient when parsing OneNote <= 2007 files
> 
>
> Key: TIKA-3077
> URL: https://issues.apache.org/jira/browse/TIKA-3077
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> The code I put in place for OneNote 2007 files is horribly inefficient. I 
> hadn't realized that the OneNoteDirectFileResource that I extracted from 
> another parser was not buffering the bytes. So every time I did a set 
> position, it was very expensive. 
> The fix is to buffer the bytes into chunks and operate them instead. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3077) OneNote parser - very inefficient when parsing OneNote <= 2007 files

2020-03-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070996#comment-17070996
 ] 

ASF GitHub Bot commented on TIKA-3077:
--

tballison commented on pull request #314: address TIKA-3077 - very slow parsing 
performance on OneNote <= 2007 docs.
URL: https://github.com/apache/tika/pull/314
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OneNote parser - very inefficient when parsing OneNote <= 2007 files
> 
>
> Key: TIKA-3077
> URL: https://issues.apache.org/jira/browse/TIKA-3077
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> The code I put in place for OneNote 2007 files is horribly inefficient. I 
> hadn't realized that the OneNoteDirectFileResource that I extracted from 
> another parser was not buffering the bytes. So every time I did a set 
> position, it was very expensive. 
> The fix is to buffer the bytes into chunks and operate them instead. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-04-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076133#comment-17076133
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

lewismc commented on pull request #315: TIKA-3082 OpenAPI for tika-server
URL: https://github.com/apache/tika/pull/315
 
 
   _PLEASE DO NOT REVIEW JUST NOW THIS IS A WIP_
   
   Hi folks, this is a progress update on 
[TIKA-3082](https://issues.apache.org/jira/browse/TIKA-3082). I'
   So far I've covered all **detector**, **information**, **language** and 
**metadata** resources. 
   Work still to do
   * create entries for **recursive metadata and content**, **tika**, 
**translate** and **unpack** resources.
   * provide descriptions for absolutely everything
   * run IBM's validator and linter to improve the quality of the OpenAPI
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3089) Text should be wrapped in pre-tags instead of in p-tags

2020-04-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082318#comment-17082318
 ] 

ASF GitHub Bot commented on TIKA-3089:
--

pweerd commented on pull request #317: fix for TIKA-3089 contributed by 
pvanderweerd
URL: https://github.com/apache/tika/pull/317
 
 
   Wrapping text in pre-tags instead of p-tags will preserving formatting much 
better
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text should be wrapped in pre-tags instead of in p-tags
> ---
>
> Key: TIKA-3089
> URL: https://issues.apache.org/jira/browse/TIKA-3089
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.24
>Reporter: Peter van der Weerd
>Priority: Minor
>
> Currently text is treated as normal html, which causes a drama in the 
> possible layout. Like, line-endings are not honored, font is not fixed, etc.
>  
> By wrapping in pre-tags, the layout will be much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3089) Text should be wrapped in pre-tags instead of in p-tags

2020-04-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082377#comment-17082377
 ] 

ASF GitHub Bot commented on TIKA-3089:
--

tballison commented on pull request #317: fix for TIKA-3089 contributed by 
pvanderweerd
URL: https://github.com/apache/tika/pull/317#discussion_r407521811
 
 

 ##
 File path: 
tika-parsers/src/main/java/org/apache/tika/parser/csv/TextAndCSVParser.java
 ##
 @@ -306,7 +306,6 @@ private CSVParams getOverride(Metadata metadata) {
 
 String delimiterString = mediaType.getParameters().get(DELIMITER);
 if (delimiterString == null) {
-return new CSVParams(mediaType, charset);
 
 Review comment:
   Why remove this?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text should be wrapped in pre-tags instead of in p-tags
> ---
>
> Key: TIKA-3089
> URL: https://issues.apache.org/jira/browse/TIKA-3089
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.24
>Reporter: Peter van der Weerd
>Priority: Minor
>
> Currently text is treated as normal html, which causes a drama in the 
> possible layout. Like, line-endings are not honored, font is not fixed, etc.
>  
> By wrapping in pre-tags, the layout will be much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3089) Text should be wrapped in pre-tags instead of in p-tags

2020-04-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082464#comment-17082464
 ] 

ASF GitHub Bot commented on TIKA-3089:
--

pweerd commented on pull request #317: fix for TIKA-3089 contributed by 
pvanderweerd
URL: https://github.com/apache/tika/pull/317#discussion_r407574495
 
 

 ##
 File path: 
tika-parsers/src/main/java/org/apache/tika/parser/csv/TextAndCSVParser.java
 ##
 @@ -306,7 +306,6 @@ private CSVParams getOverride(Metadata metadata) {
 
 String delimiterString = mediaType.getParameters().get(DELIMITER);
 if (delimiterString == null) {
-return new CSVParams(mediaType, charset);
 
 Review comment:
   This removal was unintentional. Apologies for not double checking the diff.
   Shall I create a new merge request?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text should be wrapped in pre-tags instead of in p-tags
> ---
>
> Key: TIKA-3089
> URL: https://issues.apache.org/jira/browse/TIKA-3089
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.24
>Reporter: Peter van der Weerd
>Priority: Minor
>
> Currently text is treated as normal html, which causes a drama in the 
> possible layout. Like, line-endings are not honored, font is not fixed, etc.
>  
> By wrapping in pre-tags, the layout will be much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3089) Text should be wrapped in pre-tags instead of in p-tags

2020-06-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125083#comment-17125083
 ] 

ASF GitHub Bot commented on TIKA-3089:
--

KranthiGV commented on a change in pull request #317:
URL: https://github.com/apache/tika/pull/317#discussion_r434687071



##
File path: 
tika-parsers/src/main/java/org/apache/tika/parser/csv/TextAndCSVParser.java
##
@@ -306,7 +306,6 @@ private CSVParams getOverride(Metadata metadata) {
 
 String delimiterString = mediaType.getParameters().get(DELIMITER);
 if (delimiterString == null) {
-return new CSVParams(mediaType, charset);

Review comment:
   You can revert it and commit the changes to your branch. It'd 
automatically show up in this PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Text should be wrapped in pre-tags instead of in p-tags
> ---
>
> Key: TIKA-3089
> URL: https://issues.apache.org/jira/browse/TIKA-3089
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.24
>Reporter: Peter van der Weerd
>Priority: Minor
>
> Currently text is treated as normal html, which causes a drama in the 
> possible layout. Like, line-endings are not honored, font is not fixed, etc.
>  
> By wrapping in pre-tags, the layout will be much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3008) Word Doc/Docx Formatting Extraction - Superscript/Subscript

2020-06-14 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135116#comment-17135116
 ] 

ASF GitHub Bot commented on TIKA-3008:
--

deathy opened a new pull request #321:
URL: https://github.com/apache/tika/pull/321


   adds handling of superscript/subscript in Word parsers as described in 
TIKA-3008



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Word Doc/Docx Formatting Extraction - Superscript/Subscript
> ---
>
> Key: TIKA-3008
> URL: https://issues.apache.org/jira/browse/TIKA-3008
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Cristian Vat
>Priority: Major
>
> Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
> This changes the actual text extracted since character runs are merged 
> together if only sup/sub is the difference since it doesn't generate any tags 
> in between.
> Found to be especially problematic in case of some legal documents where 
> getting "according to Art 51" instead of "according to Art 5^1^" completely 
> changes the meaning.
>  
> Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
> Sub/sup can be present on actual character run or on the document style 
> assigned to a character run.
>  
> I'm already working on fixes and test documents, will comment with work in 
> progress branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2830) Detect Media type of HEIF file correctly

2020-06-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136814#comment-17136814
 ] 

ASF GitHub Bot commented on TIKA-2830:
--

tballison merged pull request #278:
URL: https://github.com/apache/tika/pull/278


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Detect Media type of HEIF file correctly
> 
>
> Key: TIKA-2830
> URL: https://issues.apache.org/jira/browse/TIKA-2830
> Project: Tika
>  Issue Type: New Feature
>  Components: detector
>Affects Versions: 1.20
>Reporter: Laurent Grangier
>Priority: Major
> Fix For: 1.23
>
>
> When using the Tika Detector, the returned media type is "video/quicktime" 
> but not "image/heif" as expected.
> {code:java}
> new Tika().detect(heifFile);
> {code}
> If I try with one example of the following page 
> [https://nokiatech.github.io/heif/examples.html], I get the media type 
> "video/quicktime". I expect to get the correct MIME-type of HEIF (image/heif).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2830) Detect Media type of HEIF file correctly

2020-06-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136815#comment-17136815
 ] 

ASF GitHub Bot commented on TIKA-2830:
--

tballison commented on pull request #278:
URL: https://github.com/apache/tika/pull/278#issuecomment-644886130


   @makepanic I'm sorry this took forever.  We had to do some unpleasant 
shimming to upgrade drewnoakes' metadata extractor.  We've done this now, and 
this _should_ just work now.  THANK YOU!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Detect Media type of HEIF file correctly
> 
>
> Key: TIKA-2830
> URL: https://issues.apache.org/jira/browse/TIKA-2830
> Project: Tika
>  Issue Type: New Feature
>  Components: detector
>Affects Versions: 1.20
>Reporter: Laurent Grangier
>Priority: Major
> Fix For: 1.23
>
>
> When using the Tika Detector, the returned media type is "video/quicktime" 
> but not "image/heif" as expected.
> {code:java}
> new Tika().detect(heifFile);
> {code}
> If I try with one example of the following page 
> [https://nokiatech.github.io/heif/examples.html], I get the media type 
> "video/quicktime". I expect to get the correct MIME-type of HEIF (image/heif).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2888) Add wmv2 codec detection to ASF container

2020-06-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136819#comment-17136819
 ] 

ASF GitHub Bot commented on TIKA-2888:
--

tballison merged pull request #272:
URL: https://github.com/apache/tika/pull/272


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add wmv2 codec detection to ASF container
> -
>
> Key: TIKA-2888
> URL: https://issues.apache.org/jira/browse/TIKA-2888
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: David Avendasora
>Priority: Major
>  Labels: codec, container, easyfix, video
> Attachments: Video1.WMV, sample.wmv
>
>
> Attached file are .wmv file (ASF Container) with a video tracks encoded using 
> the {{WMV2}} codec. They are incorrectly detected as audio 
> ({{audio/x-ms-wma}}) file instead of video ({{video/x-ms-wmv}}) file. 
> Adding the following line to the {{tiki-mimetypes.xml}} file fixes the issue:
> {{    }}
> Test Files:
>  * [http://techslides.com/demos/samples/sample.wmv]
>  * [http://www.lehman.edu/faculty/hoffmann/itc/techteach/video/Video1.WMV] 
> Related to TIKA-939
> I will submit a pull request with the above changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151369#comment-17151369
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza commented on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653789371


   @lewismc  @tballison What do you think about swagger? 
   I want to take what Lewis did here and introduce swagger-annotations + 
swagger-jaxrs. This would remove the need for the openapi yaml file and would 
instead put that documentation for the api inside the API itself. This would 
then make it so the swagger.yaml output will easier stay in sync with the api 
in the feature, while still getting the benefit of the codegen. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151370#comment-17151370
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza edited a comment on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653789371


   @lewismc  @tballison What do you think about swagger? 
   I want to take what Lewis did here and introduce swagger-annotations + 
swagger-jaxrs. This would remove the need for the openapi yaml file and would 
instead put that documentation for the api inside the java code within 
annotations. This would then make it so the swagger.yaml output will easier 
stay in sync with the api in the feature, while still getting the benefit of 
the codegen. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151371#comment-17151371
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza edited a comment on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653789371


   @lewismc  @tballison What do you think about swagger? 
   I want to take what Lewis did here and introduce swagger-annotations + 
swagger-jaxrs. This would remove the need for the openapi yaml file and would 
instead put that documentation for the api inside the java code within 
annotations. This would then make it so the swagger.yaml output will easier 
stay in sync with the api in the feature, while still getting the benefit of 
the codegen / seamless import into postman / etc. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151374#comment-17151374
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza edited a comment on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653789371


   @lewismc  @tballison What do you think about swagger? 
   I want to take what Lewis did here and put the documentation within 
swagger-annotations + swagger-jaxrs. This would remove the need for the openapi 
yaml file and would instead put that documentation for the api inside the java 
code within annotations. This would then make it so the 
`${tikaServerEndpoint}/swagger.yaml` would produce the openapi yaml output and 
will easier stay in sync with the api in the feature, while still getting the 
benefit of the codegen / seamless import into postman / etc. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151372#comment-17151372
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza edited a comment on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653789371


   @lewismc  @tballison What do you think about swagger? 
   I want to take what Lewis did here and introduce swagger-annotations + 
swagger-jaxrs. This would remove the need for the openapi yaml file and would 
instead put that documentation for the api inside the java code within 
annotations. This would then make it so the 
`${tikaServerEndpoint}/swagger.yaml` would produce the openapi yaml output and 
will easier stay in sync with the api in the feature, while still getting the 
benefit of the codegen / seamless import into postman / etc. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151417#comment-17151417
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

lewismc commented on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653809226


   Hi Nicholas, this work is nearly completed. We will update within the week.
   We can review then... thank you for your interest.
   
   
   On Sat, Jul 4, 2020 at 10:04 Nicholas DiPiazza 
   wrote:
   
   > @lewismc  @tballison
   >  What do you think about swagger?
   > I want to take what Lewis did here and introduce swagger-annotations +
   > swagger-jaxrs. This would remove the need for the openapi yaml file and
   > would instead put that documentation for the api inside the API itself.
   > This would then make it so the swagger.yaml output will easier stay in sync
   > with the api in the feature, while still getting the benefit of the 
codegen.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   -- 
   
   *Lewis*
   Dr. Lewis J. McGibbney Ph.D, B.Sc
   *Skype*: lewis.john.mcgibbney
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151624#comment-17151624
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

nddipiazza commented on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653914555


   @lewismc cool! do you mean the openapi yaml work you have in this PR? or do 
you mean swagger implementation? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-07-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151653#comment-17151653
 ] 

ASF GitHub Bot commented on TIKA-3082:
--

lewismc commented on pull request #315:
URL: https://github.com/apache/tika/pull/315#issuecomment-653934091


   Both the OpenAPI and the implementation.
   We will be delivering the jaxrs generated project with the existing tika
   server implementation ported over.
   
   On Sun, Jul 5, 2020 at 10:17 Nicholas DiPiazza 
   wrote:
   
   > @lewismc  cool! do you mean the openapi yaml
   > work you have in this PR? or do you mean swagger implementation?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   -- 
   
   *Lewis*
   Dr. Lewis J. McGibbney Ph.D, B.Sc
   *Skype*: lewis.john.mcgibbney
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenAPI for tika-server
> ---
>
> Key: TIKA-3082
> URL: https://issues.apache.org/jira/browse/TIKA-3082
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Lewis John McGibbney
>Priority: Major
>
> On TIKA-2253, [~lewismc] asked:
> bq. I was planning on putting together an OpenAPI specification for Tika. Is 
> anyone in favor of this?
> What do people think?  How much will it change the current tika-server?  What 
> are the benefits?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3126) Consider new endpoint (metadata + content non recursive)

2020-07-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154129#comment-17154129
 ] 

ASF GitHub Bot commented on TIKA-3126:
--

nddipiazza commented on pull request #323:
URL: https://github.com/apache/tika/pull/323#issuecomment-655851164


   @tballison just dropping you a ping to see if you get a chance to review 
this one. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider new endpoint (metadata + content non recursive)
> 
>
> Key: TIKA-3126
> URL: https://issues.apache.org/jira/browse/TIKA-3126
> Project: Tika
>  Issue Type: Wish
>Reporter: Carina Antunes
>Priority: Trivial
>
> Please consider providing an endpoint like /rmeta which would return metadata 
> + content, but non recursive, ie combine the output from /tika and /meta.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2020-07-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155690#comment-17155690
 ] 

ASF GitHub Bot commented on TIKA-1570:
--

michaelwda opened a new pull request #324:
URL: https://github.com/apache/tika/pull/324


   See https://issues.apache.org/jira/browse/TIKA-1570
   
   Add a stop method that will shutdown the watchdog process and terminate the 
JVM. This is useful for Apache Commons Daemon, allowing a user to define the 
StopClass and StopMethod. Under windows, this will allow a user to run 
tika-server as a windows service with correct start/stop behavior.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3131) PDFParserConfig default values were accidentally swapped

2020-07-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155788#comment-17155788
 ] 

ASF GitHub Bot commented on TIKA-3131:
--

clarkperkins opened a new pull request #325:
URL: https://github.com/apache/tika/pull/325


   …olerance to match PDFBox defaults



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDFParserConfig default values were accidentally swapped
> 
>
> Key: TIKA-3131
> URL: https://issues.apache.org/jira/browse/TIKA-3131
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24.1
>Reporter: Clark Perkins
>Priority: Major
>
> When default values were added for averageCharTolerance and spacingTolerance 
> as a part of TIKA-3091, their values appear to have been inadvertently 
> swapped.
> From PDFBox:
> {noformat}
> private float spacingTolerance = .5f;
> private float averageCharTolerance = .3f;
> {noformat}
> From tika 1.24.1:
> {noformat}
> //The character width-based tolerance value used to estimate where spaces 
> in text should be added
> //Default taken from PDFBox.
> private Float averageCharTolerance = 0.5f;
> //The space width-based tolerance value used to estimate where spaces in 
> text should be added
> //Default taken from PDFBox.
> private Float spacingTolerance = 0.3f;
> {noformat}
> This effective change in defaults has caused PDFParser to start adding more 
> spaces than it did in 1.24 and earlier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3126) Consider new endpoint (metadata + content non recursive)

2020-07-14 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157446#comment-17157446
 ] 

ASF GitHub Bot commented on TIKA-3126:
--

nddipiazza commented on pull request #323:
URL: https://github.com/apache/tika/pull/323#issuecomment-658233475


   closing - re-opening in a new jira specifically for adding these two headers 
TIKA-3133



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider new endpoint (metadata + content non recursive)
> 
>
> Key: TIKA-3126
> URL: https://issues.apache.org/jira/browse/TIKA-3126
> Project: Tika
>  Issue Type: Wish
>Reporter: Carina Antunes
>Priority: Trivial
>
> Please consider providing an endpoint like /rmeta which would return metadata 
> + content, but non recursive, ie combine the output from /tika and /meta.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3126) Consider new endpoint (metadata + content non recursive)

2020-07-14 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157444#comment-17157444
 ] 

ASF GitHub Bot commented on TIKA-3126:
--

nddipiazza closed pull request #323:
URL: https://github.com/apache/tika/pull/323


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider new endpoint (metadata + content non recursive)
> 
>
> Key: TIKA-3126
> URL: https://issues.apache.org/jira/browse/TIKA-3126
> Project: Tika
>  Issue Type: Wish
>Reporter: Carina Antunes
>Priority: Trivial
>
> Please consider providing an endpoint like /rmeta which would return metadata 
> + content, but non recursive, ie combine the output from /tika and /meta.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3133) /rmeta endpoint should not hard code writeLimit and maxEmbeddedResources

2020-07-14 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157450#comment-17157450
 ] 

ASF GitHub Bot commented on TIKA-3133:
--

nddipiazza opened a new pull request #326:
URL: https://github.com/apache/tika/pull/326


   see https://issues.apache.org/jira/browse/TIKA-3133
   and https://issues.apache.org/jira/browse/TIKA-3126
   
   this will add new parameters to `rmeta` rest endpoint
   
   `writeLimit` - max number of characters to store; if < 0, the handler will 
store all characters
   `maxEmbeddedResources` - number of embedded resources that will be parsed. 
if < 0, it will handle unlimited embedded resources.
   
   This will make it so we can control how many embedded docs will be parsed in 
a call to rmeta, and how many bytes will be written to the body. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> /rmeta endpoint should not hard code writeLimit and maxEmbeddedResources
> 
>
> Key: TIKA-3133
> URL: https://issues.apache.org/jira/browse/TIKA-3133
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Nicholas DiPiazza
>Priority: Trivial
>
> When parsing using /rmeta endpoint, you are stuck with the unlimited 
> writeLimit and unlimited number of maxEmbeddedResources
> We should add these as optional headers that allow us to control that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 2160 matches

Mail list logo