date:20150808

[RESULT][VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-08 Thread David Meikle

Hi Everyone,

Thanks everyone for their votes.  The VOTE to release Tika 1.9 RC #1 has
passed with the following tally:

+1:
Dave Meikle*
Sergey Beryozkin*
Tim Allison*
Konstantin Gribov*
Chris Mattmann*
Oleg Tikhonov*
Ken Krugler*
Tyler Palsulich*
Hong-Thai Nguyen*

±0:
None

-1:
None

* = PMC Member

I'll push out the release now.

Cheers,
Dave

[jira] [Updated] (TIKA-776) ExifTool Embedder

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-776:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.11

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1435:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.11

 Attachments: netcdf-deps-changes.diff


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1106) CLAVIN Integration

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1106:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 CLAVIN Integration
 --

 Key: TIKA-1106
 URL: https://issues.apache.org/jira/browse/TIKA-1106
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.3
 Environment: All
Reporter: Adam Estrada
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: entity, geospatial, new-parser
 Fix For: 1.11


 I've been evaluating CLAVIN as a way to extract location information from 
 unstructured text. It seems like meshing it with Tika in some way would make 
 a lot of sense. From CLAVIN website...
 {quote}
 CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
 software package for document geotagging and geoparsing that employs 
 context-based geographic entity resolution. It combines a variety of open 
 source tools with natural language processing techniques to extract location 
 names from unstructured text documents and resolve them against gazetteer 
 records. Importantly, CLAVIN does not simply look up location names; 
 rather, it uses intelligent heuristics in an attempt to identify precisely 
 which Springfield (for example) was intended by the author, based on the 
 context of the document. CLAVIN also employs fuzzy search to handle 
 incorrectly-spelled location names, and it recognizes alternative names 
 (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic 
 entity. By enriching text documents with structured geo data, CLAVIN enables 
 hierarchical geospatial search and advanced geospatial analytics on 
 unstructured data.
 {quote}
 There was only one other instance of the word clavin mentioned in the ASF 
 jira site so I thought it was definitely worth posting here.
 https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-987:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
 

 Key: TIKA-987
 URL: https://issues.apache.org/jira/browse/TIKA-987
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.11

 Attachments: picture.doc, picture_3.doc


 I have two Word docs, both containing the same drawing, but one has
 text added.
 In one case (picture.doc) the extraction is correct: it contains only
 an embedded image.wmf; when I view the image it's correct.
 In the second case (picture_3.doc) the picture is extracted as image
 (no extension), and is 0 bytes, and there is an invalid character
 (mapped to unicode replacement char) inserted before the image:
 {noformat}
 title/
 /head
 bodyp�img src=embedded:image1 alt=image1//p
 p/
 p/
 pvehicle
 /p
 {noformat}
 (Though, the text vehicle is extracted correctly).
 I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
 MERGEFORMAT} field, which we invoke
 WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
 the 0-byte no-extension image as well as the invalid character.  With
 the first doc there is no field (at least not one that's handle with
 handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
 fix... it could be something is going wrong in how POI parses the
 Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1672) Integrate tika-java7 component

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1672:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Integrate tika-java7 component
 --

 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.11


 Code requiring Java 7 doesn't need to be in a separate module now that 
 TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1379:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 error in Tika().detect for xml files with xades signature
 -

 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.4
Reporter: Alessandro De Angelis
  Labels: new-parser
 Fix For: 1.11


 we tried to get the mime type of an xml file with xades signature embedded. 
 the result is text/html and not the expected text/xml or 
 application/xml.
 here is an example of the xml file:
 {code}
 VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
 VERBALE Id=1 tipologia=Verbale esame
   VERB_NUM00094853 0003 2/VERB_NUM
   DATA_APP2013-09-23/DATA_APP
   DATA_ESA2013-09-23/DATA_ESA
   AD_CODD69017/AD_COD
   ADFILOSOFIA DELLA SCIENZA/AD
   CDS_CODD69/CDS_COD
   CDSTEATRO E ARTI VISIVE/CDS
   TIPO_ESA/TIPO_ESA
   MAT1233456/MAT
   NOMEPAOLINO/NOME
   COGNOMEPAPERINO/COGNOME
   VOTO23.0/VOTO
   VOTODECOD23/VOTODECOD
   CAUSALE/CAUSALE
   TIPO_MODULO/TIPO_MODULO
   IMG_PATH/IMG_PATH
   AA_SES_ID2012/AA_SES_ID
   AD_CFU6.0/AD_CFU
   NOTA/NOTA
   ATENEO9/ATENEO
   ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
   TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
   TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
   AD_STU_CODD69017/AD_STU_COD
   AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
   CDS_STU_CODD69/CDS_STU_COD
   CDS_STUTEATRO E ARTI VISIVE/CDS_STU
   DOCENTEQUI QUO QUA/DOCENTE
 DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
 SOFTWARE_DI_CREAZIONE
   NOME3/NOME
   VERSIONE11.09.03/VERSIONE
 /SOFTWARE_DI_CREAZIONE
 /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
 Id=sig08744308748201048377
 ds:SignedInfo
 ds:CanonicalizationMethod 
 Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
 ds:SignatureMethod 
 Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
 ds:Reference URI=
 ds:Transforms
 ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
 dsig-xpath:XPath 
 xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
 Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
 /ds:Transform
 ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
 xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; 
 exclude-result-prefixes=kion version=1.0
   kion:ml module=FirmaDigitale target=kion/kion:ml
   xsl:output method=xml/xsl:output
   xsl:variable name=mostra_ad_figlie select=1/xsl:variable
   xsl:variable name=verbale_root 
 select=/VERBALI/VERBALE/xsl:variable
   xsl:variable name=sostituzione_root 
 select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
   xsl:variable name=RAGG_ROOT 
 select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
   xsl:variable name=COMM_ROOT 
 select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable
   
   xsl:template match=/
   html
   head
   meta content=text/html;charset=UTF-8 
 http-equiv=Content-Type/meta
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   titleDichiarazione 
 conformità Verbale Esame/title
   /xsl:when
   xsl:otherwise
   titleVerbalizzazione 
 esame/title
   /xsl:otherwise
   /xsl:choose
   style type=text/css
td  {font-family: Arial; font-size:10pt;} 
div {font-family: Arial; font-size:10pt;}
pre {font-family: Arial; font-size:10pt;} 
   /style
   /head
   body
   table
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   trtd align=center 
 colspan=2bigstrongxsl:value-of 
 select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
   trtd align=center

[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1308:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: jefferyyuan
  Labels: gae
 Fix For: 1.11


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 {code}
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 {code}
 This fails with exception:
 {code}
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 {code}
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-894:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Add webapp mode for Tika Server, simplifies deployment
 --

 Key: TIKA-894
 URL: https://issues.apache.org/jira/browse/TIKA-894
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Chris Wilson
  Labels: maven, newbie, patch
 Fix For: 1.11

 Attachments: tika-server-webapp.patch


 For use in production services, Tika Server should really be deployed as a 
 WAR file, under a reliable servlet container that knows how to run as a 
 system service, for example Tomcat or JBoss.
 This is especially important on Windows, where I wasted an entire day trying 
 to make TikaServerCli run as some kind of a service. 
 Maven makes building a webapp pretty trivial. With the attached patch 
 applied, mvn war:war should work. It seems to run fine in Tomcat, which 
 makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
 file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1108:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Represent individual slides in pptx
 ---

 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.11


 When parsing ppt, tika produces for each slide:
 div class=slide
 However for pptx these seem to be missing, all the text is directly under 
 body.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1688:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Tika Version in Metadata
 

 Key: TIKA-1688
 URL: https://issues.apache.org/jira/browse/TIKA-1688
 Project: Tika
  Issue Type: Improvement
Reporter: Paul Ramirez
Priority: Minor
 Fix For: 1.11


 Could this be added as X-Tika:version that way downstream there would be 
 traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1696:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Language Identification with Text Processing Toolkit from MITLL
 ---

 Key: TIKA-1696
 URL: https://issues.apache.org/jira/browse/TIKA-1696
 Project: Tika
  Issue Type: New Feature
  Components: languageidentifier
Reporter: Paul Ramirez
 Fix For: 1.11


 The aim here is to extend the methods for language identification within 
 text. MIT Lincoln Labs has an open source library [1] written in Julia. 
 Having spoken  with the MITLL guys there is a possibility that there is a 
 scala version of this library which would make it easier to package in with 
 Tika. 
 At this point I'm not quite sure how many languages this library supports by 
 default but it can be extended when provided some training data.
 [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1616:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Tika Parser for GIBS Metadata
 -

 Key: TIKA-1616
 URL: https://issues.apache.org/jira/browse/TIKA-1616
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
  metadata currently consists of simple stuff in the WMTS GetCapabilities 
 request (e.g. 
 http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
 which includes available layers, extents, time ranges, map projections, color 
 maps, etc. We will eventually have more detailed visualization metadata 
 available in ECHO/CMR which will include linkages to data products, 
 provenance, etc. 
 Some investigation and a Tika parser would be excellent to extract and 
 assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1366:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
 

 Key: TIKA-1366
 URL: https://issues.apache.org/jira/browse/TIKA-1366
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Sergey Beryozkin
Priority: Minor
 Fix For: 1.11


 Some of Tika Server services will benefit from optionally supporting JAX-RS 
 2.0 AsyncResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-891:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.11


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1425:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Automatic batching of Microsoft service calls
 -

 Key: TIKA-1425
 URL: https://issues.apache.org/jira/browse/TIKA-1425
 Project: Tika
  Issue Type: Improvement
  Components: translation
Affects Versions: 1.6
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 Right now when I use the following code I get the stack trace at the bottom 
 of this description. This seems to be because the Request URI is too large to 
 make the service request. We need to have a mechansim within the call to 
 Tika.translate which will, on a service-by-service basis, determine the 
 maximum Request URI which can be sent. I beleive that this should be on the 
 Tika side as how else am I meant to know the maximum request size?
 {code:title=translator.java|borderStyle=solid}
 +Translator translate = new MicrosoftTranslator();
 +((MicrosoftTranslator) translate).setId(...);
 +((MicrosoftTranslator) translate).setSecret(...);
  for (java.util.Map.EntryText, Parse entry : parseResult) {
Parse parse = entry.getValue();
LOG.info(-\nUrl\n---\n);
 @@ -201,7 +207,7 @@
System.out.print(parse.getData().toString());
if (dumpText) {
  LOG.info(-\nParseText\n-\n);
 -System.out.print(parse.getText());
 +System.out.print(translate.translate(parse.getText(), fr));
}
 {code}
 {code:title=stacktrace.log|borderStyle=solid}
 Exception in thread main java.lang.Exception: [microsoft-translator-api] 
 Error retrieving translation : Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0...
 ...
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202)
   at com.memetix.mst.translate.Translate.execute(Translate.java:61)
   at com.memetix.mst.translate.Translate.execute(Translate.java:76)
   at 
 org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104)
   at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228)
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE%D1%80%D1%83%D0%B...
 ...
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199)
   ... 6 more
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE...
 ...
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177)
   ... 7 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-985:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Support for HTML5 elements
 --

 Key: TIKA-985
 URL: https://issues.apache.org/jira/browse/TIKA-985
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
 Fix For: 1.11

 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
 TIKA-985-1.3-3.patch, TIKA-985-1.5.patch


 TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
 section). This prevents some custom ContentHandlers from reading expected 
 elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1208:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Migrate Any23 mime contributions to Tika
 

 Key: TIKA-1208
 URL: https://issues.apache.org/jira/browse/TIKA-1208
 Project: Tika
  Issue Type: Sub-task
  Components: mime
Reporter: Lewis John McGibbney
 Fix For: 1.11

 Attachments: TIKA-1208.patch


 We begin with one of the most obvious areas in which there
 is overlap.
 In short, the appeal of this package is the addition of detection 
 for the following types:
  - text/n3
  - text/rdf+n3
  - application/n3
  - text/x-nquads
  - text/rdf+nq
  - text/nq
  - application/nq
  - text/turtle
  - application/x-turtle
  - application/turtle
  - application/trix
  
 Therefore although both Tika and Any23 execute the task of Mimetype-related
 tasks, there is a contribution to be made. This involves the trasferral of
 code pertaining to pattern recogition, Mimetype XML defitinions within 
 tika-mimetypes.xml and a Purifier implementation that removes all 
 the eventual blank characters at the header of a file that might 
 prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1508) Add uniformity to parser parameter configuration

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1508:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Add uniformity to parser parameter configuration
 

 Key: TIKA-1508
 URL: https://issues.apache.org/jira/browse/TIKA-1508
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
 Fix For: 1.11


 We can currently configure parsers by the following means:
 1) programmatically by direct calls to the parsers or their config objects
 2) sending in a config object through the ParseContext
 3) modifying .properties files for specific parsers (e.g. PDFParser)
 Rather than scattering the landscape with .properties files for each parser, 
 it would be great if we could specify parser parameters in the main config 
 file, something along the lines of this:
 {noformat}
 parser class=org.apache.tika.parser.audio.AudioParser
   params
 int name=someparam12/int
 str name=someOtherParam2something or other/str
   /params
   mimeaudio/basic/mime
   mimeaudio/x-aiff/mime
   mimeaudio/x-wav/mime
 /parser
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1417:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Create Extract Embedded Images from PDFs Example
 

 Key: TIKA-1417
 URL: https://issues.apache.org/jira/browse/TIKA-1417
 Project: Tika
  Issue Type: Improvement
  Components: example
Reporter: Tyler Palsulich
Priority: Minor
 Fix For: 1.11


 Users commonly want to turn on extraction of images embedded in PDFs (e.g. 
 TIKA-1414). Tika has the capability, but it's not clear how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1516) Downgrade Rome dependency to 0.9 to avoid nasty NPE

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1516:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Downgrade Rome dependency to 0.9 to avoid nasty NPE
 ---

 Key: TIKA-1516
 URL: https://issues.apache.org/jira/browse/TIKA-1516
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: TIKA-1516.patch


 As documented [in this 
 thread|http://www.mail-archive.com/dev%40nutch.apache.org/msg15755.html] 
 Nutch's 
 [parse-tika|https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/plugin.xml#L56]
  uses Rome 1.0, this is inherited directly from the Tika pom.xml for the 
 [same 
 depenency|https://github.com/apache/tika/blob/trunk/tika-parsers/pom.xml#L184].
 A downgrade is required.
 {code}
 java.lang.Exception: java.lang.ExceptionInInitializerError
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
 Caused by: java.lang.ExceptionInInitializerError
 at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
 at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70)
 at 
 org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:105)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.NullPointerException
 at java.util.Properties$LineReader.readLine(Properties.java:418)
 at java.util.Properties.load0(Properties.java:337)
 at java.util.Properties.load(Properties.java:325)
 at 
 com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java:74)
 at 
 com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
 at 
 com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54)
 at 
 com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46)
 at 
 com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40)
 at 
 com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59)
 ... 16 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-774) ExifTool Parser

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-774:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 ExifTool Parser
 ---

 Key: TIKA-774
 URL: https://issues.apache.org/jira/browse/TIKA-774
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
 Environment: Requires be installed 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: features, new-parser, newbie, patch
 Fix For: 1.11

 Attachments: testJPEG_IPTC_EXT.jpg, 
 tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt


 Adds an external parser that calls ExifTool to extract extended metadata 
 fields from images and other content types.
 In the core project:
 An ExifTool interface is added which contains Property objects that define 
 the metadata fields available.
 An additional Property constructor for internalTextBag type.
 In the parsers project:
 An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
 on the command line and mapping the response to tika metadata fields.  This 
 extractor could be called instead of or in addition to the existing 
 ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
 JpegParser but those have not been changed at this time.
 An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
 An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
 metadata fields to existing tika and Drew Noakes metadata fields if enabled.
 An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
 implementations in XML files.
 An ExifToolParserTest is added which tests several expected XMP and IPTC 
 metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1609:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Leverage Google's LibPhonenumber for enhanced phone number extraction and 
 metadata modeling
 ---

 Key: TIKA-1609
 URL: https://issues.apache.org/jira/browse/TIKA-1609
 Project: Tika
  Issue Type: New Feature
  Components: core
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 Google's Libphonenumber can provide us with comprehensive support for 
 modeling Phone number metadata properly in Tika.
 During the development of this patch I realized two things, namely
  * This is not a parser as such as Phone numbers are not mapped to any 
 particular Mimetype
  * In addition, there can be many phone numbers per document, so this is most 
 likely a Content Handler of sorts
  * Tika's Metadata support is currently too restrictive to allow us to 
 persist many complex objects e.g. String, Object. We need to expand Meatdata 
 support over and above String, String[].
 https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-819:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.11


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder

2015-08-08 Thread Dave Meikle (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dave Meikle updated TIKA-1343:
--
Fix Version/s: (was: 1.10)
1.11

* Pushed to 1.11 following 1.10 release

Create a Tika Translator implementation that uses JoshuaDecoder
---

Key: TIKA-1343
URL: https://issues.apache.org/jira/browse/TIKA-1343
Project: Tika
Issue Type: New Feature
Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.11

The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine
translation system hosted at Github:
http://joshua-decoder.org/
Joshua takes in corpuses and trains models that can then be used to do
language translation. Currently there is support for e.g., Spanisn-English,
Indian dialects-English, Chinese-English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of
course several issues with this:
* the models are huge - so we'll need a separate package or Maven module,
maybe tika-translate-joshua or something to release the models and we'll need
to build the models. I just went through the process of building the
Spanish-English one, and it still needs to be rebuilt b/c I did it wrong,
but it took over a day
* there is a configuration for Joshua, and so we need some way of passing
that config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the
Joshua lists about this:
https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual
install into my Maven repo for brave souls out there that want to try it.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1697:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Parser Implementation for AkomaNtoso Legal XML Documents
 

 Key: TIKA-1697
 URL: https://issues.apache.org/jira/browse/TIKA-1697
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal 
 Document XML standard and used pervasively within parliaments and other 
 legislative arenas.
 This issue should utilize the 
 [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and 
 populate Metadata for AkomaNtoso .xml and .akn documents.
 I'll send a PR for this soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1395:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Create embedded image extraction example
 

 Key: TIKA-1395
 URL: https://issues.apache.org/jira/browse/TIKA-1395
 Project: Tika
  Issue Type: Sub-task
  Components: example
Reporter: Tyler Palsulich
Priority: Minor
 Fix For: 1.11


 Create an example of how to turn do embedded image extraction and parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1390) Create tika-example module

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1390:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Create tika-example module
 --

 Key: TIKA-1390
 URL: https://issues.apache.org/jira/browse/TIKA-1390
 Project: Tika
  Issue Type: Bug
  Components: example
Reporter: Tyler Palsulich
 Fix For: 1.11


 This issue will track the initial creation of the tika-example module. 
 Subtasks will be used for the first few examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1540:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 New Tika plugin for image based feature extraction using computer vision 
 techniques
 ---

 Key: TIKA-1540
 URL: https://issues.apache.org/jira/browse/TIKA-1540
 Project: Tika
  Issue Type: New Feature
 Environment: cross platform
Reporter: Aashish Chaudhary
Assignee: Lewis John McGibbney
  Labels: gsoc2015
 Fix For: 1.11

 Attachments: TIKA-vision.achaudhary.150209.patch.txt


 This will be a web-service client based parser to perform image feature 
 extraction using Computer Vision techniques. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1513) Add mime detection and parsing for dbf files

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1513:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.11


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-715:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
  Labels: newbie
 Fix For: 1.11

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!

[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1607:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1456:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Visual Sentiment API parser
 ---

 Key: TIKA-1456
 URL: https://issues.apache.org/jira/browse/TIKA-1456
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gsoc2015
 Fix For: 1.11


 Integrate the Visual Sentibank API as a parser for images. We can use 
 Aperture from CMU, it's released under the MIT license:
 https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1640:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Make ExternalParser support aliases for key names in extracted metadata
 ---

 Key: TIKA-1640
 URL: https://issues.apache.org/jira/browse/TIKA-1640
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.11


 Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
 did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
 for this, but one thing Ray's code-based work did that my config oriented 
 work didn't is allow for renaming extracted metadata key names to better 
 support having consistent metadata across parsers.
 Here's one way to do it:
 ExternalParser could have a config section like so:
 {code:xml}
 aliases
   metadata key=foo alias=bar/
   metadata key=foo2 alias=bar2/
 /aliases
 {code}
 Then this could be used to rename metadata keys.
 I'll implement that in this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1465:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Implement extraction of non-global variables from netCDF3 and netCDF4
 -

 Key: TIKA-1465
 URL: https://issues.apache.org/jira/browse/TIKA-1465
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 Speaking to Eric Nienhouse at the ongoing NSF funded Polar 
 Cyberinfrastructure hackathon in NYC, we became aware that variables 
 parameters contained within netCDF3 and netCDF4 are just as valuable (if not 
 more valuable) as global attribute values. 
 AFAIK, right now we only extract global attributes however we could extend 
 the support to cater for the above observations.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1059:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.11


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1301:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Establish TikaServer on Apache hosted VM
 

 Key: TIKA-1301
 URL: https://issues.apache.org/jira/browse/TIKA-1301
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
 our service on
 http://any23.org
 I would like to do the same for Tika. I have some scripts on the Any23 VM 
 which will pull stable nightly tika-server snapshots and deploy them to the 
 VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1577:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 NetCDF Data Extraction
 --

 Key: TIKA-1577
 URL: https://issues.apache.org/jira/browse/TIKA-1577
 Project: Tika
  Issue Type: Improvement
  Components: handler, parser
Affects Versions: 1.7
Reporter: Ann Burgess
Assignee: Ann Burgess
  Labels: features, handler
 Fix For: 1.11

   Original Estimate: 504h
  Remaining Estimate: 504h

 A netCDF classic or 64-bit offset dataset is stored as a single file 
 comprising two parts:
  - a header, containing all the information about dimensions, attributes, and 
 variables except for the variable data;
  - a data part, comprising fixed-size data, containing the data for variables 
 that don't have an unlimited dimension; and variable-size data, containing 
 the data for variables that have an unlimited dimension.
 The NetCDFparser currently extracts the header part.  
  -- text extracts file Dimensions and Variables
  -- metadata extracts Global Attributes
 We want the option to extract the data part of NetCDF files.  
 Lets use the NetCDF test file for our dev testing:  
 tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1318:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Use of Deprecated Word6Extractor.getParagraphText() Method
 --

 Key: TIKA-1318
 URL: https://issues.apache.org/jira/browse/TIKA-1318
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tyler Palsulich
Priority: Minor
  Labels: deprecation
 Fix For: 1.11


 org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the 
 deprecated Word6Extractor.getParagraphText() method. getParagraphText() is 
 supposed to return a String[] with an element for each paragraph in the text. 
 The replacement is getText(), which lets paragraph, cell, etc separation be 
 implementation specific. I'm not sure, at this point, how the POI 
 WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-988:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 We don't extract a placeholder for a Word document embedded in an Excel 
 document
 

 Key: TIKA-988
 URL: https://issues.apache.org/jira/browse/TIKA-988
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.11

 Attachments: bug31373.xls


 In TIKA-956 we fixed the Word parser so that at the point where an embedded 
 document appears, we output a div class=embedded id=_XXX/ tag.
 It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1367:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Tika documentation should list tika-parsers parser dependencies
 ---

 Key: TIKA-1367
 URL: https://issues.apache.org/jira/browse/TIKA-1367
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Reporter: Sergey Beryozkin
 Fix For: 1.11


 tika-parsers module has many strong transitive parser dependencies. Maven 
 users of tika-parsers have to exclude all the transitivie dependencies 
 manually. Documenting the list of the existing transitive dependencies and 
 keeping the list up to date will help developers exclude the libraries not 
 needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1328:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Translate Metadata and Content
 --

 Key: TIKA-1328
 URL: https://issues.apache.org/jira/browse/TIKA-1328
 Project: Tika
  Issue Type: New Feature
  Components: translation
Reporter: Tyler Palsulich
 Fix For: 1.11


 Right now, Translation is only done on Strings. Ideally, users would be able 
 to turn on translation while parsing. I can think of a couple options:
 - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
 it, then translate the content.
 - Make a Context switch. When true, translate the content regardless of the 
 parser used. I'm not sure the best way to go about this method, but I prefer 
 it over another Parser.
 Regardless, we need a black or white list for translation. I think black list 
 would be the way to go -- which fields should not be translated (dates, 
 versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
 other open source translation libraries? If we were really lucky, it wouldn't 
 depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1657) Allow easier dumping of TikaConfig file from tika-core

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1657:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Allow easier dumping of TikaConfig file from tika-core
 --

 Key: TIKA-1657
 URL: https://issues.apache.org/jira/browse/TIKA-1657
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.11


 In TIKA-1418, we added an example for how to dump the config file so that 
 users could easily modify it.  I think we should go further and make this an 
 option at the tika-core level with hooks for tika-app and tika-server.  I 
 propose adding a main() to TikaConfig that will print the xml config file 
 that Tika is currently using to stdout.
 I'd like to put this into core so that e.g. Solr's DIH users can get by 
 without having to download tika-app separately.  
 There's every chance that I've not accounted for issues with dynamic loading 
 etc.  Also, I'd be ok with only having this available in tika-app and 
 tika-server if there are good reasons.
 Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1598:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Parser Implementation for Streaming Video
 -

 Key: TIKA-1598
 URL: https://issues.apache.org/jira/browse/TIKA-1598
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11


 A number of us have been discussing a Tika implementation which could, for 
 example, bind to a live multimedia stream and parse content from the stream 
 until it finished.
 An excellent example would be watching Bonnie Scotland beating R. of Ireland 
 in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
 17:00 GMT :)
 I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
 http://sourceforge.net/projects/jffmpeg/
 I am not sure... plus it is not licensed liberally enough for us to include 
 so if there are other implementations then please post them here.
 I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1674:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Add example to show how to extract embedded files
 -

 Key: TIKA-1674
 URL: https://issues.apache.org/jira/browse/TIKA-1674
 Project: Tika
  Issue Type: New Feature
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.11


 On tika-user, we received a question on how to extract embedded files.  Let's 
 add an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1505:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 chmparser breaks down when extracting from file of CHM format v3
 

 Key: TIKA-1505
 URL: https://issues.apache.org/jira/browse/TIKA-1505
 Project: Tika
  Issue Type: Bug
Reporter: Bin Hawking
 Fix For: 1.11


 chmparser throws exception or returns faulty text when:
 1. extracting from file of CHM format version 3
 2. chm file with lzx reset interval  2
 3. chm file with 5000 objects
 I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2015-08-08 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-980:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 MicrodataContentHandler for Apache Tika
 ---

 Key: TIKA-980
 URL: https://issues.apache.org/jira/browse/TIKA-980
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Ken Krugler
 Fix For: 1.11

 Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
 TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch


 ContentHandler for Apache Tika capable of building a data structure 
 containing Microdata item scopes and item properties. The Item* classes are 
 borrowed from the Apache Any23 project and are slightly modified to 
 accomodate this SAX-based extractor vs the original DOM-based extractor.
 The provided unit test outputs two item scopes about the Europe and NA 
 ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-08-08 Thread Ian Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662971#comment-14662971
 ] 

Ian Williams commented on TIKA-894:
---

I am out of the office until Mon 10 Aug 2015.

Regards
Ian



 Add webapp mode for Tika Server, simplifies deployment
 --

 Key: TIKA-894
 URL: https://issues.apache.org/jira/browse/TIKA-894
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Chris Wilson
  Labels: maven, newbie, patch
 Fix For: 1.11

 Attachments: tika-server-webapp.patch


 For use in production services, Tika Server should really be deployed as a 
 WAR file, under a reliable servlet container that knows how to run as a 
 system service, for example Tomcat or JBoss.
 This is especially important on Windows, where I wasted an entire day trying 
 to make TikaServerCli run as some kind of a service. 
 Maven makes building a webapp pretty trivial. With the attached patch 
 applied, mvn war:war should work. It seems to run fine in Tomcat, which 
 makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
 file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[ANNOUNCE] Apache Tika 1.10 release

2015-08-08 Thread David Meikle

The Apache Tika project is pleased to announce the release of Apache Tika 1.10. 
The release contents have been pushed out to the main Apache release site and 
to the Central sync, so the releases should be available as soon as the mirrors 
get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and structured 
text content from various documents using existing parser libraries.

Apache Tika 1.10 contains a number of improvements and bug fixes. Details can 
be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.10.txt 
http://www.apache.org/dist/tika/CHANGES-1.10.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip 
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ 
http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads 
using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc 
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/ http://tika.apache.org/

-- David Meikle, on behalf of the Apache Tika community

Re: [ANNOUNCE] Apache Tika 1.10 release

2015-08-08 Thread Tyler Palsulich

Thanks, Dave!

On Sat, Aug 8, 2015, 7:01 AM David Meikle dmei...@apache.org wrote:

 The Apache Tika project is pleased to announce the release of Apache Tika
 1.10. The release contents have been pushed out to the main Apache release
 site and to the Central sync, so the releases should be available as soon
 as the mirrors get the syncs.

 Apache Tika is a toolkit for detecting and extracting metadata and
 structured text content from various documents using existing parser
 libraries.

 Apache Tika 1.10 contains a number of improvements and bug fixes. Details
 can be found in the changes file:
 http://www.apache.org/dist/tika/CHANGES-1.10.txt 
 http://www.apache.org/dist/tika/CHANGES-1.10.txt

 Apache Tika is available in source form from the following download page:
 http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip 
 http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip

 Apache Tika is also available in binary form or for use using Maven 2 from
 the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ 
 http://repo1.maven.org/maven2/org/apache/tika/

 In the initial 48 hours, the release may not be available on all mirrors.
 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 https://people.apache.org/keys/group/tika.asc 
 https://people.apache.org/keys/group/tika.asc

 For more information on Apache Tika, visit the project home page:
 http://tika.apache.org/ http://tika.apache.org/

 -- David Meikle, on behalf of the Apache Tika community

50 matches

Mail list logo