[jira] [Commented] (TIKA-1982) Add language (and possibly other fields) to /rmeta endpoint

2018-07-16 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546037#comment-16546037
 ] 

Chris A. Mattmann commented on TIKA-1982:
-

[~talli...@apache.org] any chance we could get this into 1.19?

> Add language (and possibly other fields) to /rmeta endpoint
> ---
>
> Key: TIKA-1982
> URL: https://issues.apache.org/jira/browse/TIKA-1982
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.14
> Environment: Debian Jessie
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Tika 1.14 
>Reporter: Philipp Steinkrueger
>Assignee: Tim Allison
>Priority: Minor
>  Labels: features
>
> While the /meta endpoint includes the detected language of a document send to 
> the server, the /rmeta endpoint does not.
> This may apply to other metadata as well. In general, I think, the /meta 
> endpoint should be a subset of /rmeta endpoint so that there is nothing in 
> /meta that is not also available in /rmeta.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

2018-07-06 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534984#comment-16534984
 ] 

Chris A. Mattmann commented on TIKA-2672:
-

GREAT WORK [~ThejanWijesinghe] thanks my guy

> Upgrade dl4j to 1.0.0-beta
> --
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got 
> this error when reading the json config file.  Can someone with more 
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid 
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) 
> for width dimension:  Invalid input configuration for kernel width. Require 0 
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides 
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution 
> mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-05 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533837#comment-16533837
 ] 

Chris A. Mattmann commented on TIKA-2684:
-

gotcha, well if you tell me the GDAL commands necessary to get the header info, 
I can probably update the parser and see if you get the information you are 
looking for. I'm not a GDAL expert so I would need to know the commands to run 
from the GDAL CLI since Tika just wraps it IIRC.

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A&A...376..359H BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-05 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533824#comment-16533824
 ] 

Chris A. Mattmann edited comment on TIKA-2684 at 7/5/18 3:41 PM:
-

hehe, well I know GDAL handles fits, and Tika is [integrated with 
GDAL|https://wiki.apache.org/tika/TikaGDAL]. Give that guide a try and see if 
you get the metadata and header info you are looking for :)


was (Author: chrismattmann):
hehe, well I know GDAL handles fits, and Tika is [integrated with 
GDAL|http://wiki.apache.org/tika/GDALParser]. Give that guide a try and see if 
you get the metadata and header info you are looking for :)

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A&A...376..359H BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-05 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533824#comment-16533824
 ] 

Chris A. Mattmann edited comment on TIKA-2684 at 7/5/18 3:40 PM:
-

hehe, well I know GDAL handles fits, and Tika is [integrated with 
GDAL|http://wiki.apache.org/tika/GDALParser]. Give that guide a try and see if 
you get the metadata and header info you are looking for :)


was (Author: chrismattmann):
hehe, well I know GDAL handles fits, and Tika is [integrated with 
GDAL|[http://wiki.apache.org/tika/GDALParser]]. Give that guide a try and see 
if you get the metadata and header info you are looking for :)

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A&A...376..359H BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-05 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533824#comment-16533824
 ] 

Chris A. Mattmann commented on TIKA-2684:
-

hehe, well I know GDAL handles fits, and Tika is [integrated with 
GDAL|[http://wiki.apache.org/tika/GDALParser]]. Give that guide a try and see 
if you get the metadata and header info you are looking for :)

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A&A...376..359H BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-94) Speech recognition

2018-06-11 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-94:
-

Assignee: Chris A. Mattmann

> Speech recognition
> --
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-94) Speech recognition

2018-06-11 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508258#comment-16508258
 ] 

Chris A. Mattmann commented on TIKA-94:
---

Yes [~ThejanWijesinghe] let's start with your Tensorflow one then we can use 
[https://github.com/USCDataScience/dl4j-kerasimport-examples/] to bring it into 
DL4J and tika-dl.

> Speech recognition
> --
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-94) Speech recognition

2018-06-10 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507635#comment-16507635
 ] 

Chris A. Mattmann commented on TIKA-94:
---

great to hear. Check this out: 
[https://deeplearning4j.org/opendata#speech-datasets] 

These could easily be integrated into tika-dl as a module, similar to inception 
and VGG16.

> Speech recognition
> --
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-94) Speech recognition

2018-06-08 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506106#comment-16506106
 ] 

Chris A. Mattmann commented on TIKA-94:
---

Thanks for asking [~edwinyeozl]. There is an opportunity here to use the 
tika-dl module nowadays and DL4J and to do some speech recognition. 
[~ThejanWijesinghe] and I have been discussing it briefly. Would you like to 
contribute?

> Speech recognition
> --
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf

2018-05-26 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491792#comment-16491792
 ] 

Chris A. Mattmann commented on TIKA-2646:
-

[~adidier] see comment above from [~lfcnassif]

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---
>
> Key: TIKA-2646
> URL: https://issues.apache.org/jira/browse/TIKA-2646
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.18
> Environment: MacOS Sierra 10.12.6
>Reporter: Annie Didier
>Priority: Trivial
>  Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes 
> mixed and the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a 
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and 
> the "-" within the first column is misordered. The last two columns have 
> switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

2018-05-24 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489926#comment-16489926
 ] 

Chris A. Mattmann commented on TIKA-2520:
-

Integrated into 2.x master too:
{noformat}
[INFO]

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---

[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom

[INFO] 

[INFO] Reactor Summary:

[INFO]

[INFO] Apache Tika parent . SUCCESS [  1.977 s]

[INFO] Apache Tika core ... SUCCESS [ 30.959 s]

[INFO] Apache Tika parsers  SUCCESS [03:25 min]

[INFO] Apache Tika XMP  SUCCESS [  2.420 s]

[INFO] Apache Tika serialization .. SUCCESS [  1.955 s]

[INFO] Apache Tika batch .. SUCCESS [01:58 min]

[INFO] Apache Tika language detection . SUCCESS [  2.731 s]

[INFO] Apache Tika application  SUCCESS [01:07 min]

[INFO] Apache Tika OSGi bundle  SUCCESS [ 31.078 s]

[INFO] Apache Tika translate .. SUCCESS [  3.269 s]

[INFO] Apache Tika server . SUCCESS [ 21.436 s]

[INFO] Apache Tika examples ... SUCCESS [ 15.475 s]

[INFO] Apache Tika Java-7 Components .. SUCCESS [  3.467 s]

[INFO] Apache Tika eval ... SUCCESS [ 40.324 s]

[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [01:02 min]

[INFO] Apache Tika Natural Language Processing  SUCCESS [ 25.107 s]

[INFO] Apache Tika  SUCCESS [  0.030 s]

[INFO] 

[INFO] BUILD SUCCESS

[INFO] 

[INFO] Total time: 10:34 min

[INFO] Finished at: 2018-05-24T14:30:18-07:00

[INFO] Final Memory: 203M/1743M

[INFO] 

nonas:tika2.0.0 mattmann$

[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom

[INFO] 

[INFO] Reactor Summary:

[INFO]

[INFO] Apache Tika parent . SUCCESS [  1.977 s]

[INFO] Apache Tika core ... SUCCESS [ 30.959 s]

[INFO] Apache Tika parsers  SUCCESS [03:25 min]

[INFO] Apache Tika XMP  SUCCESS [  2.420 s]

[INFO] Apache Tika serialization .. SUCCESS [  1.955 s]

[INFO] Apache Tika batch .. SUCCESS [01:58 min]

[INFO] Apache Tika language detection . SUCCESS [  2.731 s]

[INFO] Apache Tika application  SUCCESS [01:07 min]

[INFO] Apache Tika OSGi bundle  SUCCESS [ 31.078 s]

[INFO] Apache Tika translate .. SUCCESS [  3.269 s]

[INFO] Apache Tika server . SUCCESS [ 21.436 s]

[INFO] Apache Tika examples ... SUCCESS [ 15.475 s]

[INFO] Apache Tika Java-7 Components .. SUCCESS [  3.467 s]

[INFO] Apache Tika eval ... SUCCESS [ 40.324 s]

[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [01:02 min]

[INFO] Apache Tika Natural Language Processing  SUCCESS [ 25.107 s]

[INFO] Apache Tika  SUCCESS [  0.030 s]

[INFO] 

[INFO] BUILD SUCCESS

[INFO] 

[INFO] Total time: 10:34 min

[INFO] Finished at: 2018-05-24T14:30:18-07:00

[INFO] Final Memory: 203M/1743M

[INFO] 

nonas:tika2.0.0 mattmann$ git push -u origin master

Counting objects: 11, done.

Delta compression using up to 4 threads.

Compressing objects: 100% (7/7), done.

Writing objects: 100% (11/11), 1.38 KiB | 1.38 MiB/s, done.

Total 11 (delta 3), reused 0 (delta 0)

remote: Resolving deltas: 100% (3/3), completed with 3 local objects.

To github.com:/apache/tika.git

   e24e6afb1..5c1143b30  master -> master

Branch 'master' set up to track remote branch 'master' from 'origin'.

nonas:tika2.0.0 mattmann${noformat}

> OptimaizeLangDetector#loadMod

[jira] [Comment Edited] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

2018-05-24 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489790#comment-16489790
 ] 

Chris A. Mattmann edited comment on TIKA-2520 at 5/24/18 8:56 PM:
--

{noformat}
nonas:tika2.0.0 mattmann$ git push -u origin branch_1x

Counting objects: 14, done.

Delta compression using up to 4 threads.

Compressing objects: 100% (10/10), done.

Writing objects: 100% (14/14), 1.72 KiB | 252.00 KiB/s, done.

Total 14 (delta 4), reused 0 (delta 0)

remote: Resolving deltas: 100% (4/4), completed with 4 local objects.

To github.com:/apache/tika.git

   cdca0f726..7e3e34caf  branch_1x -> branch_1x

Branch 'branch_1x' set up to track remote branch 'branch_1x' from 
'origin'.{noformat}
 

 


was (Author: chrismattmann):
{noformat}

nonas:tika2.0.0 mattmann$ git push -u origin branch_1x

Counting objects: 14, done.

Delta compression using up to 4 threads.

Compressing objects: 100% (10/10), done.

Writing objects: 100% (14/14), 1.72 KiB | 252.00 KiB/s, done.

Total 14 (delta 4), reused 0 (delta 0)

remote: Resolving deltas: 100% (4/4), completed with 4 local objects.

To github.com:/apache/tika.git

   cdca0f726..7e3e34caf  branch_1x -> branch_1x

Branch 'branch_1x' set up to track remote branch 'branch_1x' from 'origin'.

nonas:tika2.0.0 mattmann${noformat}

> OptimaizeLangDetector#loadModels() should not be called for every single 
> langdetect HTTP request
> 
>
> Key: TIKA-2520
> URL: https://issues.apache.org/jira/browse/TIKA-2520
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.16
>Reporter: Vincent van Donselaar
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: performance
> Fix For: 1.19
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy 
> `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
>   LanguageResult language = new 
> OptimaizeLangDetector().loadModels().detect(string);
>   String detectedLang = language.getLanguage();
>   LOG.info("Detecting language for incoming resource: [{}]", 
> detectedLang);
>   return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them 
> in memory. I assume the `LanguageDetector` is not thread safe, so I expect 
> this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

2018-05-24 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2520.
-
   Resolution: Fixed
Fix Version/s: 1.19

{noformat}

nonas:tika2.0.0 mattmann$ git push -u origin branch_1x

Counting objects: 14, done.

Delta compression using up to 4 threads.

Compressing objects: 100% (10/10), done.

Writing objects: 100% (14/14), 1.72 KiB | 252.00 KiB/s, done.

Total 14 (delta 4), reused 0 (delta 0)

remote: Resolving deltas: 100% (4/4), completed with 4 local objects.

To github.com:/apache/tika.git

   cdca0f726..7e3e34caf  branch_1x -> branch_1x

Branch 'branch_1x' set up to track remote branch 'branch_1x' from 'origin'.

nonas:tika2.0.0 mattmann${noformat}

> OptimaizeLangDetector#loadModels() should not be called for every single 
> langdetect HTTP request
> 
>
> Key: TIKA-2520
> URL: https://issues.apache.org/jira/browse/TIKA-2520
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.16
>Reporter: Vincent van Donselaar
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: performance
> Fix For: 1.19
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy 
> `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
>   LanguageResult language = new 
> OptimaizeLangDetector().loadModels().detect(string);
>   String detectedLang = language.getLanguage();
>   LOG.info("Detecting language for incoming resource: [{}]", 
> detectedLang);
>   return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them 
> in memory. I assume the `LanguageDetector` is not thread safe, so I expect 
> this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

2018-05-24 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-2520:
---

Assignee: Chris A. Mattmann

> OptimaizeLangDetector#loadModels() should not be called for every single 
> langdetect HTTP request
> 
>
> Key: TIKA-2520
> URL: https://issues.apache.org/jira/browse/TIKA-2520
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.16
>Reporter: Vincent van Donselaar
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: performance
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy 
> `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
>   LanguageResult language = new 
> OptimaizeLangDetector().loadModels().detect(string);
>   String detectedLang = language.getLanguage();
>   LOG.info("Detecting language for incoming resource: [{}]", 
> detectedLang);
>   return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them 
> in memory. I assume the `LanguageDetector` is not thread safe, so I expect 
> this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf

2018-05-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484338#comment-16484338
 ] 

Chris A. Mattmann commented on TIKA-2646:
-

Tim thanks - this is for a project at JPL and I asked Annie to raise this 
issue. Thanks for the pointer to tabula-java...appreciate it. We will circle 
back here.

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---
>
> Key: TIKA-2646
> URL: https://issues.apache.org/jira/browse/TIKA-2646
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.18
> Environment: MacOS Sierra 10.12.6
>Reporter: Annie Didier
>Priority: Trivial
>  Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes 
> mixed and the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a 
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and 
> the "-" within the first column is misordered. The last two columns have 
> switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-11-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2400.
-
Resolution: Fixed
  Assignee: Chris A. Mattmann

merged!

> Standardizing current Object Recognition REST parsers
> -
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition 
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> #  Moving the logic related to checking minimum confidence into servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2503) Try to upgrade httpclient to >=4.5.3

2017-11-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249913#comment-16249913
 ] 

Chris A. Mattmann commented on TIKA-2503:
-

thanks Tim, no we don't have coverage. I bet that exclusion would work. I can 
try this week.

> Try to upgrade httpclient to >=4.5.3
> 
>
> Key: TIKA-2503
> URL: https://issues.apache.org/jira/browse/TIKA-2503
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2503) Try to upgrade httpclient to >=4.5.3

2017-11-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249877#comment-16249877
 ] 

Chris A. Mattmann commented on TIKA-2503:
-

for OpeNDAP datasets I believe we would need this...yep

> Try to upgrade httpclient to >=4.5.3
> 
>
> Key: TIKA-2503
> URL: https://issues.apache.org/jira/browse/TIKA-2503
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2464) No PIL found while running the docker image 'InceptionVideoRestDockerfile'

2017-09-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2464.
-
   Resolution: Fixed
Fix Version/s: 1.17

Committed thanks [~armathur]!

> No PIL found while running the docker image 'InceptionVideoRestDockerfile'
> --
>
> Key: TIKA-2464
> URL: https://issues.apache.org/jira/browse/TIKA-2464
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Aman R Mathur
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.17
>
>
> Adding 'pip install flask requests pillow' in InceptionVideoRestDockerfile 
> line 68. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (TIKA-2464) No PIL found while running the docker image 'InceptionVideoRestDockerfile'

2017-09-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-2464:
---

Assignee: Chris A. Mattmann

> No PIL found while running the docker image 'InceptionVideoRestDockerfile'
> --
>
> Key: TIKA-2464
> URL: https://issues.apache.org/jira/browse/TIKA-2464
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Aman R Mathur
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.17
>
>
> Adding 'pip install flask requests pillow' in InceptionVideoRestDockerfile 
> line 68. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2332) Output SNOMED codes for CUIs in CTAKES output?

2017-08-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2332.
-
Resolution: Fixed

committed thanks!

> Output SNOMED codes for CUIs in CTAKES output?
> --
>
> Key: TIKA-2332
> URL: https://issues.apache.org/jira/browse/TIKA-2332
> Project: Tika
>  Issue Type: New Feature
>Reporter: Dillon Welch
>  Labels: memex
>
> I am trying to use CTAKES in an environment where I need to process large 
> amounts of documents automatically, therefore I am looking to use the TIKA 
> app for running it. Unfortunately, I have found that the server output only 
> gives the CUIs for any results, but I need the SNOMED code associated with 
> the CUI. 
> I found the code at 
> https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java#L204
>  that is responsible for this output. 
> Is there any other solution besides editing the code there to provide more 
> output/adding a new option for additional output?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2332) Output SNOMED codes for CUIs in CTAKES output?

2017-08-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2332:

Fix Version/s: 1.17

> Output SNOMED codes for CUIs in CTAKES output?
> --
>
> Key: TIKA-2332
> URL: https://issues.apache.org/jira/browse/TIKA-2332
> Project: Tika
>  Issue Type: New Feature
>Reporter: Dillon Welch
>  Labels: memex
> Fix For: 1.17
>
>
> I am trying to use CTAKES in an environment where I need to process large 
> amounts of documents automatically, therefore I am looking to use the TIKA 
> app for running it. Unfortunately, I have found that the server output only 
> gives the CUIs for any results, but I need the SNOMED code associated with 
> the CUI. 
> I found the code at 
> https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java#L204
>  that is responsible for this output. 
> Is there any other solution besides editing the code there to provide more 
> output/adding a new option for additional output?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2332) Output SNOMED codes for CUIs in CTAKES output?

2017-08-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2332:

Labels: memex  (was: )

> Output SNOMED codes for CUIs in CTAKES output?
> --
>
> Key: TIKA-2332
> URL: https://issues.apache.org/jira/browse/TIKA-2332
> Project: Tika
>  Issue Type: New Feature
>Reporter: Dillon Welch
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.17
>
>
> I am trying to use CTAKES in an environment where I need to process large 
> amounts of documents automatically, therefore I am looking to use the TIKA 
> app for running it. Unfortunately, I have found that the server output only 
> gives the CUIs for any results, but I need the SNOMED code associated with 
> the CUI. 
> I found the code at 
> https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java#L204
>  that is responsible for this output. 
> Is there any other solution besides editing the code there to provide more 
> output/adding a new option for additional output?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (TIKA-2332) Output SNOMED codes for CUIs in CTAKES output?

2017-08-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-2332:
---

Assignee: Chris A. Mattmann

> Output SNOMED codes for CUIs in CTAKES output?
> --
>
> Key: TIKA-2332
> URL: https://issues.apache.org/jira/browse/TIKA-2332
> Project: Tika
>  Issue Type: New Feature
>Reporter: Dillon Welch
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.17
>
>
> I am trying to use CTAKES in an environment where I need to process large 
> amounts of documents automatically, therefore I am looking to use the TIKA 
> app for running it. Unfortunately, I have found that the server output only 
> gives the CUIs for any results, but I need the SNOMED code associated with 
> the CUI. 
> I found the code at 
> https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java#L204
>  that is responsible for this output. 
> Is there any other solution besides editing the code there to provide more 
> output/adding a new option for additional output?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2355) Cache trained mode while running ObjectRecognition server from Docker builds

2017-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2355:

Component/s: parser

> Cache trained mode while running ObjectRecognition server from Docker builds 
> -
>
> Key: TIKA-2355
> URL: https://issues.apache.org/jira/browse/TIKA-2355
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.17
>
>
> DockerBuilds of ObjectRecognition downloads model every time server starts. 
> This can be prevented by initializing code once before we exit from 
> DockerBuild so model is downloaded once and always available at subsequent 
> server startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2355) Cache trained mode while running ObjectRecognition server from Docker builds

2017-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2355:

Labels: memex  (was: )

> Cache trained mode while running ObjectRecognition server from Docker builds 
> -
>
> Key: TIKA-2355
> URL: https://issues.apache.org/jira/browse/TIKA-2355
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.17
>
>
> DockerBuilds of ObjectRecognition downloads model every time server starts. 
> This can be prevented by initializing code once before we exit from 
> DockerBuild so model is downloaded once and always available at subsequent 
> server startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2355) Cache trained mode while running ObjectRecognition server from Docker builds

2017-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2355:

Description: DockerBuilds of ObjectRecognition downloads model every time 
server starts. This can be prevented by initializing code once before we exit 
from DockerBuild so model is downloaded once and always available at subsequent 
server startup  (was: DockerBuilds of ObjectRecognition downloads model every 
time server starts. This can be prevented by initializing code once before we 
exit from DockerBuild so model is downloaded once and always available at 
sunsequent server startup)

> Cache trained mode while running ObjectRecognition server from Docker builds 
> -
>
> Key: TIKA-2355
> URL: https://issues.apache.org/jira/browse/TIKA-2355
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Madhav Sharan
>  Labels: memex
> Fix For: 1.17
>
>
> DockerBuilds of ObjectRecognition downloads model every time server starts. 
> This can be prevented by initializing code once before we exit from 
> DockerBuild so model is downloaded once and always available at subsequent 
> server startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2355) Cache trained mode while running ObjectRecognition server from Docker builds

2017-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2355:

Fix Version/s: 1.17

> Cache trained mode while running ObjectRecognition server from Docker builds 
> -
>
> Key: TIKA-2355
> URL: https://issues.apache.org/jira/browse/TIKA-2355
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.17
>
>
> DockerBuilds of ObjectRecognition downloads model every time server starts. 
> This can be prevented by initializing code once before we exit from 
> DockerBuild so model is downloaded once and always available at subsequent 
> server startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2355) Cache trained mode while running ObjectRecognition server from Docker builds

2017-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2355.
-
Resolution: Fixed
  Assignee: Chris A. Mattmann

Fixed!
{noformat}
LMC-053601:tf mattmann$ git commit -m "record changes for TIKA-2265."
[master c6b6b17] record changes for TIKA-2265.
 1 file changed, 2 insertions(+), 1 deletion(-)
LMC-053601:tf mattmann$ git push -u origin master
Counting objects: 13, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (13/13), 1.06 KiB | 0 bytes/s, done.
Total 13 (delta 6), reused 0 (delta 0)
remote: Resolving deltas: 100% (6/6), completed with 5 local objects.
To https://github.com/apache/tika.git
   2bcc0a7..c6b6b17  master -> master
Branch master set up to track remote branch master from origin.
LMC-053601:tf mattmann$ 
{noformat}

Thanks Madhav.

> Cache trained mode while running ObjectRecognition server from Docker builds 
> -
>
> Key: TIKA-2355
> URL: https://issues.apache.org/jira/browse/TIKA-2355
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: memex
>
> DockerBuilds of ObjectRecognition downloads model every time server starts. 
> This can be prevented by initializing code once before we exit from 
> DockerBuild so model is downloaded once and always available at subsequent 
> server startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2434) Language detection slow, cpu intensive, CLI interrupts work

2017-08-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120263#comment-16120263
 ] 

Chris A. Mattmann commented on TIKA-2434:
-

[~talli...@apache.org] ping

> Language detection slow, cpu intensive, CLI interrupts work
> ---
>
> Key: TIKA-2434
> URL: https://issues.apache.org/jira/browse/TIKA-2434
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.16
> Environment: OS X 10.11.6, JRE 1.8.0_25
>Reporter: Stefan Karner
>
> Since version 1.16, when using tika -l FILE, it takes a lot longer than e.g. 
> 1.15.
> Also, when batch processing a bunch of files in the background, the Java 
> runtime icon pops up when processing the next file, stealing the input focus 
> from whatever other application I'm currently working on, thus constantly 
> interrupting my work.
> Also, the Java runtime uses from 100% to 400% CPU when executing Tika.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2402) Support all image formats in Object Recognition REST Parser

2017-08-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2402:

Labels: memex  (was: )

> Support all image formats in Object Recognition REST Parser
> ---
>
> Key: TIKA-2402
> URL: https://issues.apache.org/jira/browse/TIKA-2402
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.17
>
>
> Currently object recognition REST parser only supports parsing jpeg image 
> type. Objective of this task is to add all image format support to it by 
> converting any image into jpeg format at the server's end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2402) Support all image formats in Object Recognition REST Parser

2017-08-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2402.
-
Resolution: Fixed
  Assignee: Chris A. Mattmann

- fixed!

> Support all image formats in Object Recognition REST Parser
> ---
>
> Key: TIKA-2402
> URL: https://issues.apache.org/jira/browse/TIKA-2402
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.17
>
>
> Currently object recognition REST parser only supports parsing jpeg image 
> type. Objective of this task is to add all image format support to it by 
> converting any image into jpeg format at the server's end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2434) Language detection slow, cpu intensive, CLI interrupts work

2017-08-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109908#comment-16109908
 ] 

Chris A. Mattmann commented on TIKA-2434:
-

Tim, for #1, once you know how it runs against the regression corpus, we can 
open a PR in homebrew-core and get it updated.


> Language detection slow, cpu intensive, CLI interrupts work
> ---
>
> Key: TIKA-2434
> URL: https://issues.apache.org/jira/browse/TIKA-2434
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.16
> Environment: OS X 10.11.6, JRE 1.8.0_25
>Reporter: Stefan Karner
>
> Since version 1.16, when using tika -l FILE, it takes a lot longer than e.g. 
> 1.15.
> Also, when batch processing a bunch of files in the background, the Java 
> runtime icon pops up when processing the next file, stealing the input focus 
> from whatever other application I'm currently working on, thus constantly 
> interrupting my work.
> Also, the Java runtime uses from 100% to 400% CPU when executing Tika.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2262:

Labels: deeplearning gsoc2017 machine_learning memex  (was: deeplearning 
gsoc2017 machine_learning)

> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: deeplearning, gsoc2017, machine_learning, memex
> Fix For: 1.17
>
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079637#comment-16079637
 ] 

Chris A. Mattmann commented on TIKA-2262:
-

Documentation is here: https://wiki.apache.org/tika/ImageCaption

> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: deeplearning, gsoc2017, machine_learning, memex
> Fix For: 1.17
>
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2262.
-
   Resolution: Fixed
Fix Version/s: 1.17

Congratulations [~ThejanWijesinghe] your work is now merged with master! Thanks 
[~tgow...@gmail.com] and [~talli...@apache.org] for the help!

> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: deeplearning, gsoc2017, machine_learning
> Fix For: 1.17
>
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-2262:
---

Assignee: Chris A. Mattmann  (was: Thamme Gowda)

> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Age Detection Tika Recogniser

2017-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078876#comment-16078876
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

For now yes [~talli...@mitre.org] until we fix 
https://github.com/USCDataScience/AgePredictor/issues/11 in a 1.1 release later.

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.17
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Age Detection Tika Recogniser

2017-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078283#comment-16078283
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

Sounds good to me...almost done with tika-nlp will commit shortly.

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Age Detection Tika Recogniser

2017-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078237#comment-16078237
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

Agree on #3. I'm going to take a first cut at tika-nlp. In the future when we 
unify our recognisers for Object/Text, we should think about moving the NER 
stuff from tika-parsers into tika-nlp. I'm not going to bother now, b/c it 
would create a situation where people previously had tika-app support NER, but 
in the future they would have to include tika-nlp.

The other thing I think we should seriously consider - that tika-app's size 
ballooned as you put it - who cares? what if I'll gladly take a 181MB jar file 
if it gives me capability A, B, C, D all in a box? Two thoughts there. First is 
that we stop worrying about keeping tika-app so small. Pros: easy, doesn't 
require anything special; Cons: Size aficionados will be disappointed ;) 
Second, we could make a tika-app-full module and tika-server-full that is 
tika-app, plus tika-dl and tika-nlp. Thoughts there?

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Age Detection Tika Recogniser

2017-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078155#comment-16078155
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

#1 - absolutely - i thought putting the model download in Thamme's 
ModelGetter.groovy script would ensure that even in Proxy environments that 
models were available. Tim why weren't the models available for you?

#2 - sure jiminey Christmas - wow that's a lot of dependencies. What do you 
think about tika-nlp, with this as the first entry?

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077692#comment-16077692
 ] 

Chris A. Mattmann commented on TIKA-2298:
-

docs added here: https://wiki.apache.org/tika/AgeDetectionParser and linked 
from front page

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-1988) Age Detection Tika Recogniser

2017-07-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1988.
-
Resolution: Fixed

- merged into master thanks [~msha...@usc.edu], [~tgow...@gmail.com] and 
[~talli...@apache.org]

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2017-07-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1988:

Labels: age machine_learning memex nlp opennlp  (was: age memex nlp opennlp)

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2017-07-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1988:

Fix Version/s: 1.16

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (TIKA-1988) Age Detection Tika Recogniser

2017-07-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1988:
---

Assignee: Chris A. Mattmann

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2017-07-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1988:

Labels: age memex nlp opennlp  (was: )

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.16
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075859#comment-16075859
 ] 

Chris A. Mattmann commented on TIKA-2298:
-

fixed, was a simple typo - you forgot to set the config object = the new 
TikaConfig

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075855#comment-16075855
 ] 

Chris A. Mattmann commented on TIKA-2298:
-

docs added in: https://wiki.apache.org/tika/TikaAndVisionDL4J

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075849#comment-16075849
 ] 

Chris A. Mattmann commented on TIKA-2298:
-

[~talli...@apache.org] your latest update causes Jenkins and my local build to 
fail:

{noformat}
---
 T E S T S
---
Running org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.268 sec - in 
org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest
Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.353 sec <<< 
FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest
recognise(org.apache.tika.dl.imagerec.DL4JVGG16NetTest)  Time elapsed: 6.353 
sec  <<< ERROR!
java.lang.NullPointerException: null
at org.apache.tika.Tika.(Tika.java:109)
at 
org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:40)


Results :

Tests in error: 
  DL4JVGG16NetTest.recognise:40 » NullPointer

Tests run: 2, Failures: 0, Errors: 1, Skipped: 0

[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent . SUCCESS [  1.169 s]
[INFO] Apache Tika core ... SUCCESS [ 23.745 s]
[INFO] Apache Tika parsers  SUCCESS [03:20 min]
[INFO] Apache Tika XMP  SUCCESS [  1.323 s]
[INFO] Apache Tika serialization .. SUCCESS [  1.114 s]
[INFO] Apache Tika batch .. SUCCESS [01:47 min]
[INFO] Apache Tika language detection . SUCCESS [  2.683 s]
[INFO] Apache Tika application  SUCCESS [ 43.016 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [ 18.439 s]
[INFO] Apache Tika translate .. SUCCESS [  1.794 s]
[INFO] Apache Tika server . SUCCESS [ 36.437 s]
[INFO] Apache Tika examples ... SUCCESS [  5.494 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [  1.815 s]
[INFO] Apache Tika eval ... SUCCESS [ 22.354 s]
[INFO] Apache Tika Deep Learning (powered by DL4J)  FAILURE [ 14.242 s]
[INFO] Apache Tika  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 08:01 min
[INFO] Finished at: 2017-07-05T18:33:59-07:00
[INFO] Final Memory: 126M/1659M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
project tika-dl: There are test failures.
[ERROR] 
[ERROR] Please refer to 
/Users/mattmann/tmp/tika1.15/tika-dl/target/surefire-reports for the individual 
test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :tika-dl
LMC-053601:tika1.15 mattmann$ 

{noformat}

I'm going to try and fix real quick.


> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to

[jira] [Updated] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2298:

Labels: ObjectRecognitionParser gsoc memex  (was: ObjectRecognitionParser 
memex)

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2298:

Labels: ObjectRecognitionParser memex  (was: ObjectRecognitionParser)

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser, gsoc, memex
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075755#comment-16075755
 ] 

Chris A. Mattmann commented on TIKA-2298:
-

YES sounds perfect thanks [~talli...@apache.org]

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-07-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2298.
-
Resolution: Fixed
  Assignee: Chris A. Mattmann

Thanks to [~asmehra95] and [~thammegowda] and [~talli...@mitre.org] for their 
help this is now merged into 1.16!

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>Assignee: Chris A. Mattmann
>  Labels: ObjectRecognitionParser
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2017-06-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1988:

Summary: Age Detection Tika Recogniser  (was: Tika parser for extracting 
text based features)

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1804) Tika use no free json.org

2017-06-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059429#comment-16059429
 ] 

Chris A. Mattmann commented on TIKA-1804:
-

great work Tim & Adam & team!

> Tika use no free json.org
> -
>
> Key: TIKA-1804
> URL: https://issues.apache.org/jira/browse/TIKA-1804
> Project: Tika
>  Issue Type: Bug
>Reporter: gil cattaneo
>Priority: Blocker
> Attachments: deps_new.txt
>
>
> Hi
> Your project is licensed under Apache License Version 2,
> but your code pulls in code from json.org under Douglas Crockford’s bad 
> licence [1] , and is non-free [2].
> Such usage restriction makes the license incompatible with The Open Source 
> Definition and
> The Free Software Definition. Because Tika binary distribution includes this 
> software,
> it effectively becomes proprietary software itself.
> You may also comment that the json.org license is valid for You but for many 
> Linux distributions it is not acceptable.
> I hope to continue to maintain Tika for Fedora, without having to run into 
> these problems.
> Please try to replace it with one of the many free alternatives.
> Regards
> [1]
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java
> [2]
> https://wiki.debian.org/qa.debian.org/jsonevil
> http://www.sonatype.com/people/2012/03/use-json-well-youd-better-not-be-evil/
> http://tanguy.ortolo.eu/blog/article46/json-license



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Tika parser for extracting text based features

2017-06-21 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058744#comment-16058744
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

sorry I missed it! will look now

> Tika parser for extracting text based features
> --
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1988) Tika parser for extracting text based features

2017-06-21 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058731#comment-16058731
 ] 

Chris A. Mattmann commented on TIKA-1988:
-

sounds great [~msha...@usc.edu] any progress?

> Tika parser for extracting text based features
> --
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2017-06-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043056#comment-16043056
 ] 

Chris A. Mattmann commented on TIKA-2368:
-

hey [~talli...@apache.org] we're working on this right now, and hope to have it 
fixed in time for 1.16. You can see the work going on here: 
http://github.com/USCDataScience/SentimentAnalysisParser/pulls

> Clean up SentimentParser dependencies
> -
>
> Key: TIKA-2368
> URL: https://issues.apache.org/jira/browse/TIKA-2368
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Is there any way to avoid reliance on edu.usc.ir's sentiment-analysis-parser? 
>  I ask because:
> {noformat}
> [WARNING] sentiment-analysis-parser-0.1.jar, tika-parsers-1.15-SNAPSHOT.jar 
> define 1 overlapping classes: 
> [WARNING]   - org.apache.tika.parser.sentiment.analysis.SentimentParser
> [WARNING] tika-core-1.15-SNAPSHOT.jar, tika-translate-1.15-SNAPSHOT.jar 
> define 4 overlapping classes: 
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator$1
> [WARNING]   - org.apache.tika.language.translate.EmptyTranslator
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator
> [WARNING]   - org.apache.tika.language.translate.Translator
> {noformat}
> We should be ok keeping things as they are and excluding SentimentParser and 
> tika-translate, but can we easily move the code that's still in edu.usc.ir's 
> package into Tika?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2373) Fix licenses via rat before 1.15 release

2017-05-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019789#comment-16019789
 ] 

Chris A. Mattmann commented on TIKA-2373:
-

exclude away!

> Fix licenses via rat before 1.15 release
> 
>
> Key: TIKA-2373
> URL: https://issues.apache.org/jira/browse/TIKA-2373
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run

2017-05-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019785#comment-16019785
 ] 

Chris A. Mattmann commented on TIKA-1334:
-

awesome! they look great!

> Add presentation layer for results of each run
> --
>
> Key: TIKA-1334
> URL: https://issues.apache.org/jira/browse/TIKA-1334
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: 1-Landing page.png, 2-File Types.png, 3-Mime Types.png, 
> 4-Detected Extensions.png, 5-Conflicts between actual and detected 
> extension.png, static_stats.zip
>
>
> If I'm doing this, it'll probably be vintage mid-90s html.  If someone with 
> some .js kung-fu wants to take this, please do.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-1106) CLAVIN Integration

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1106.
-
Resolution: Won't Fix

we already have the GeoTopicParser so going to close this one out.

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.16
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1815.
-
   Resolution: Fixed
Fix Version/s: (was: 1.16)
   1.15

> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.15
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1106:

Fix Version/s: (was: 1.15)
   1.16

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.16
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1505:

Fix Version/s: (was: 1.15)
   1.16

> chmparser breaks down when extracting from file of CHM format v3
> 
>
> Key: TIKA-1505
> URL: https://issues.apache.org/jira/browse/TIKA-1505
> Project: Tika
>  Issue Type: Bug
>Reporter: Bin Hawking
> Fix For: 1.16
>
>
> chmparser throws exception or returns faulty text when:
> 1. extracting from file of CHM format version 3
> 2. chm file with lzx reset interval > 2
> 3. chm file with >5000 objects
> I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1329:

Fix Version/s: (was: 1.15)
   1.16

> Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
> ---
>
> Key: TIKA-1329
> URL: https://issues.apache.org/jira/browse/TIKA-1329
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.16
>
> Attachments: test_recursive_embedded.docx, TIKA-1329-site.patch, 
> TIKA-1329v2.patch
>
>
> Jukka and Nick have a great demo of parsing metadata recursively on the 
> [wiki|http://wiki.apache.org/tika/RecursiveMetadata].  For TIKA-1302, I'd 
> like to use something similar, and I think that others may find it useful for 
> tika-app and tika-server.
> I took the code from the wiki and made some modifications.  I'm not sure if 
> we should put this in parsers or in a new module for "examples."  Given that 
> I think this would be useful for tika-app and tika-server, I'd prefer 
> parsers, but I'm open to any input...including "let's not."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1577:

Fix Version/s: (was: 1.15)
   1.16

> NetCDF Data Extraction
> --
>
> Key: TIKA-1577
> URL: https://issues.apache.org/jira/browse/TIKA-1577
> Project: Tika
>  Issue Type: Improvement
>  Components: handler, parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: features, handler
> Fix For: 1.16
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1379:

Fix Version/s: (was: 1.15)
   1.16

> error in Tika().detect for xml files with xades signature
> -
>
> Key: TIKA-1379
> URL: https://issues.apache.org/jira/browse/TIKA-1379
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.4
>Reporter: Alessandro De Angelis
>  Labels: new-parser
> Fix For: 1.16
>
>
> we tried to get the mime type of an xml file with xades signature embedded. 
> the result is "text/html" and not the expected "text/xml" or 
> "application/xml".
> here is an example of the xml file:
> {code}
> 
> 
>   00094853 0003 2
>   2013-09-23
>   2013-09-23
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   
>   1233456
>   PAOLINO
>   PAPERINO
>   23.0
>   23
>   
>   
>   
>   2012
>   6.0
>   
>   9
>   جامعة البندقية - TEST
>   Verbale_3
>   QUI QUO QUA
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   QUI QUO QUA
> 26-09-2013 09:55:53 CEST(+0200)
> 
>   3
>   11.09.03
> 
> http://www.w3.org/2000/09/xmldsig#"; 
> Id="sig08744308748201048377">
> 
>  Algorithm="http://www.w3.org/2006/12/xml-c14n11";>
>  Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256";>
> 
> 
> http://www.w3.org/2002/06/xmldsig-filter2";>
>  xmlns:dsig-xpath="http://www.w3.org/2002/06/xmldsig-filter2"; 
> Filter="subtract">/descendant::ds:Signature
> 
> http://www.w3.org/TR/1999/REC-xslt-19991116";>
> http://www.kion.it/webesse3/multilingua"; 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; 
> exclude-result-prefixes="kion" version="1.0">
>   
>   
>   
>select="/VERBALI/VERBALE">
>select="/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO">
>select="/VERBALI/VERBALE/RAGGRUPPAMENTO">
>select="/VERBALI/VERBALE/COMMISSIONE">
>   
>   
>   
>   
>http-equiv="Content-Type">
>
>test="$sostituzione_root">
>   Dichiarazione 
> conformità Verbale Esame
>   
>   
>   Verbalizzazione 
> esame
>   
>   
>   
>td  {font-family: Arial; font-size:10pt;} 
>div {font-family: Arial; font-size:10pt;}
>pre {font-family: Arial; font-size:10pt;} 
>   
>   
>   
>   
>
>test="$sostituzione_root">
>colspan="2"> select="$verbale_root/ATENEO_DES">
>colspan="2">DICHIARAZIONE DI 
> CONFORMITÀ
>colspan="2">Il sottoscritto  select="$verbale_root/TITOLARE_PROCEDIMENTO">, docente di 
> 
>  
>   
>   
>     
>   
>test="$sostituzione_root/MOTIVAZIONE">
>   
> PREMESSO CHE
>   
>  
>   
>  select="$sostituzione_root/MOTIVAZIONE">
>   
>  
>   
> 
>   
>   
>   
>   
> DICHIARA
>    
> 
>  

[jira] [Updated] (TIKA-1800) MediaType#parse does not decode escaped special characters

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1800:

Fix Version/s: (was: 1.15)
   1.16

> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
> Fix For: 1.16
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1808) Head section closed too eager

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1808:

Fix Version/s: (was: 1.15)
   1.16

> Head section closed too eager
> -
>
> Key: TIKA-1808
> URL: https://issues.apache.org/jira/browse/TIKA-1808
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.16
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or 
> this is a problem in TagSoup. In this [1] case a  element appears in the 
> head, causing the head to be closed. Subsequent  elements do not appear 
> in custom ContentHandlers so i cannot read the document's title, or any other 
> meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. 
> schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't 
> really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1609:

Fix Version/s: (was: 1.15)
   1.16

> Leverage Google's LibPhonenumber for enhanced phone number extraction and 
> metadata modeling
> ---
>
> Key: TIKA-1609
> URL: https://issues.apache.org/jira/browse/TIKA-1609
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Google's Libphonenumber can provide us with comprehensive support for 
> modeling Phone number metadata properly in Tika.
> During the development of this patch I realized two things, namely
>  * This is not a parser as such as Phone numbers are not mapped to any 
> particular Mimetype
>  * In addition, there can be many phone numbers per document, so this is most 
> likely a Content Handler of sorts
>  * Tika's Metadata support is currently too restrictive to allow us to 
> persist many complex objects e.g. String, Object. We need to expand Meatdata 
> support over and above String, String[].
> https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1640:

Fix Version/s: (was: 1.15)
   1.16

> Make ExternalParser support aliases for key names in extracted metadata
> ---
>
> Key: TIKA-1640
> URL: https://issues.apache.org/jira/browse/TIKA-1640
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.16
>
>
> Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
> did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
> for this, but one thing Ray's code-based work did that my config oriented 
> work didn't is allow for renaming extracted metadata key names to better 
> support having consistent metadata across parsers.
> Here's one way to do it:
> ExternalParser could have a config section like so:
> {code:xml}
> 
>   
>   
> 
> {code}
> Then this could be used to rename metadata keys.
> I'll implement that in this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1738:

Fix Version/s: (was: 1.15)
   1.16

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1815:

Fix Version/s: (was: 1.15)
   1.16

> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.16
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-985) Support for HTML5 elements

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---
Fix Version/s: (was: 1.15)
   1.16

> Support for HTML5 elements
> --
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.16
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2016.
-
Resolution: Fixed

this is fixed - thanks to [~thammegowda]!

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.16
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---
Fix Version/s: (was: 1.15)
   1.16

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-539_2.patch, TIKA-539.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1465:

Fix Version/s: (was: 1.15)
   1.16

> Implement extraction of non-global variables from netCDF3 and netCDF4
> -
>
> Key: TIKA-1465
> URL: https://issues.apache.org/jira/browse/TIKA-1465
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Speaking to Eric Nienhouse at the ongoing NSF funded Polar 
> Cyberinfrastructure hackathon in NYC, we became aware that variables 
> parameters contained within netCDF3 and netCDF4 are just as valuable (if not 
> more valuable) as global attribute values. 
> AFAIK, right now we only extract global attributes however we could extend 
> the support to cater for the above observations.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2016:

Fix Version/s: (was: 1.16)
   1.15

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.15
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1390) Create tika-example module

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1390:

Fix Version/s: (was: 1.15)
   1.16

> Create tika-example module
> --
>
> Key: TIKA-1390
> URL: https://issues.apache.org/jira/browse/TIKA-1390
> Project: Tika
>  Issue Type: Bug
>  Components: example
>Reporter: Tyler Palsulich
> Fix For: 1.16
>
>
> This issue will track the initial creation of the tika-example module. 
> Subtasks will be used for the first few examples.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1454:

Fix Version/s: (was: 1.15)
   1.16

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> ---
>
> Key: TIKA-1454
> URL: https://issues.apache.org/jira/browse/TIKA-1454
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
> Environment: RedHat EL5, EL6, EL7
>Reporter: Chris Bryant
>Assignee: Tim Allison
> Fix For: 1.16
>
> Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1616:

Fix Version/s: (was: 1.15)
   1.16

> Tika Parser for GIBS Metadata
> -
>
> Key: TIKA-1616
> URL: https://issues.apache.org/jira/browse/TIKA-1616
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
>  metadata currently consists of simple stuff in the WMTS GetCapabilities 
> request (e.g. 
> http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
> which includes available layers, extents, time ranges, map projections, color 
> maps, etc. We will eventually have more detailed visualization metadata 
> available in ECHO/CMR which will include linkages to data products, 
> provenance, etc. 
> Some investigation and a Tika parser would be excellent to extract and 
> assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2016:

Fix Version/s: (was: 1.15)
   1.16

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.16
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-980:
---
Fix Version/s: (was: 1.15)
   1.16

> MicrodataContentHandler for Apache Tika
> ---
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Ken Krugler
> Fix For: 1.16
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:

Fix Version/s: (was: 1.15)
   1.16

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.16
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1952) Access Date is getting modified while capturing the MetaData information using AutoDetectParser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1952:

Fix Version/s: (was: 1.15)
   1.16

> Access Date is getting modified while capturing the MetaData information 
> using AutoDetectParser
> ---
>
> Key: TIKA-1952
> URL: https://issues.apache.org/jira/browse/TIKA-1952
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.12
> Environment: Windows
>Reporter: RameshKalidindi
>  Labels: features
> Fix For: 1.16
>
>
> I have been developing a project where in am capturing the MetaData 
> information( like File name, Author, File Extension, Last Modified Date and 
> Access Date) of each file in a folder using AutoDetectParser of Tika, I am 
> able to get meta data information for all files in a given folder, but my 
> issue is that the value of Access Date (MetaData attibute) is getting changed 
> with current date and Time as the program is accessing the each and every 
> file while extracting the MetaData information.
> My Issue : is there anyway that i can get the last Access Date of the file? 
> or can we stop changing Access Date value that was happening due to 
> AutoDetectParser of Tika API. Please help me in this regard. 
> Note: This Access Date information is very important  for my project, based 
> on this we need to build reports.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2298:

Fix Version/s: (was: 1.15)
   1.16

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>  Labels: ObjectRecognitionParser
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1706:

Fix Version/s: (was: 1.15)
   1.16

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1220:

Fix Version/s: (was: 1.15)
   1.16

> Parser implementration for IFC files
> 
>
> Key: TIKA-1220
> URL: https://issues.apache.org/jira/browse/TIKA-1220
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.16
>
> Attachments: 2012-03-23-Duplex-Programming.ifc
>
>
> The Industry Foundation Classes (IFC) [0] data model is intended to describe 
> building and construction industry data. For the sake of argument, it can be 
> considered as a more intelligent successor to the .dwg data models used 
> within CAD models.
> I've tracked down a potential 3rd party library [1] which we maybe able to 
> wrap and use within Tika however the provided software packages are licensed 
> under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
> over on legal-discuss@ in an attempt to see if it is possible to wrap some 
> code and contribute it to tika-parsers.
> When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
> to help the developers package the code as a Maven artifact(s), then I will 
> progress with writing the implementation.  
> [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
> [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:

Fix Version/s: (was: 1.15)
   1.16

> Represent individual slides in pptx
> ---
>
> Key: TIKA-1108
> URL: https://issues.apache.org/jira/browse/TIKA-1108
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.16
>
>
> When parsing ppt, tika produces for each slide:
> 
> However for pptx these seem to be missing, all the text is directly under 
> .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1308:

Fix Version/s: (was: 1.15)
   1.16

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: jefferyyuan
>  Labels: gae
> Fix For: 1.16
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> {code}
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> {code}
> This fails with exception:
> {code}
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> {code}
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.15)
   1.16

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.16
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parse

[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1328:

Fix Version/s: (was: 1.15)
   1.16

> Translate Metadata and Content
> --
>
> Key: TIKA-1328
> URL: https://issues.apache.org/jira/browse/TIKA-1328
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Tyler Palsulich
> Fix For: 1.16
>
>
> Right now, Translation is only done on Strings. Ideally, users would be able 
> to "turn on" translation while parsing. I can think of a couple options:
> - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
> it, then translate the content.
> - Make a Context switch. When true, translate the content regardless of the 
> parser used. I'm not sure the best way to go about this method, but I prefer 
> it over another Parser.
> Regardless, we need a black or white list for translation. I think black list 
> would be the way to go -- which fields should not be translated (dates, 
> versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
> other open source translation libraries? If we were really lucky, it wouldn't 
> depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---
Fix Version/s: (was: 1.15)
   1.16

> We don't extract a placeholder for a Word document embedded in an Excel 
> document
> 
>
> Key: TIKA-988
> URL: https://issues.apache.org/jira/browse/TIKA-988
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.16
>
> Attachments: bug31373.xls
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded 
> document appears, we output a  tag.
> It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1697:

Fix Version/s: (was: 1.15)
   1.16

> Parser Implementation for AkomaNtoso Legal XML Documents
> 
>
> Key: TIKA-1697
> URL: https://issues.apache.org/jira/browse/TIKA-1697
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal 
> Document XML standard and used pervasively within parliaments and other 
> legislative arenas.
> This issue should utilize the 
> [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and 
> populate Metadata for AkomaNtoso .xml and .akn documents.
> I'll send a PR for this soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1598:

Fix Version/s: (was: 1.15)
   1.16

> Parser Implementation for Streaming Video
> -
>
> Key: TIKA-1598
> URL: https://issues.apache.org/jira/browse/TIKA-1598
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.16
>
>
> A number of us have been discussing a Tika implementation which could, for 
> example, bind to a live multimedia stream and parse content from the stream 
> until it finished.
> An excellent example would be watching Bonnie Scotland beating R. of Ireland 
> in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
> 17:00 GMT :)
> I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
> http://sourceforge.net/projects/jffmpeg/
> I am not sure... plus it is not licensed liberally enough for us to include 
> so if there are other implementations then please post them here.
> I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1518) Docker with Tika Server

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1518:

Fix Version/s: (was: 1.15)
   1.16

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
> Fix For: 1.16
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1953:

Fix Version/s: (was: 1.15)
   1.16

> tika-server NullPointerException while processing rtfs
> --
>
> Key: TIKA-1953
> URL: https://issues.apache.org/jira/browse/TIKA-1953
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
> Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>Reporter: Ravi
>Assignee: Tim Allison
>  Labels: newbie, rtf, tika-python, tika-server, xmlContent,
> Fix For: 1.16
>
> Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server 
> returned status: 422 error
> I start the tika server and then run the following code in the python kernel 
> at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I 
> get the right output but setting this flag creates the NullPointerException 
> below
> --
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@21f0dbb9
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at 
> org.eclipse.j

[jira] [Updated] (TIKA-774) ExifTool Parser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---
Fix Version/s: (was: 1.15)
   1.16

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>Assignee: Chris A. Mattmann
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.16
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   3   4   5   6   7   8   9   10   >