[ANNOUNCE] Apache Science and Healthcare Track @ApacheCon NA 2015

2015-01-08 Thread Lewis John Mcgibbney
Hi Folks,

Apologies for cross posting :(

As some of you may already know, @ApacheCon NA 2015 is happening in Austin,
TX April 13th-16th.

This email is specifically written to attract all folks interested in
Science and Healthcare... this is an official call to arms! I am aware that
there are many Science and Healthcare-type people lingering in the Apache
Semantic Web communities. This one is for you folks.

Over a number of years the Science track has been emerging as an attractive
and exciting, at times mind blowing non-traditional track running alongside
the resident HTTP server, Big Data, etc tracks. The Semantic Web Track is
another such emerging track which has proved popular. This year we want to
really get the message out there about how much Apache technology is
actually being used in Science and Healthcare. This is not *only* aimed at
attracting members of the communities below

but also at potentially attracting a brand new breed of conference
participants to ApacheCon  and
the Foundation e.g. Scientists who love Apache. We are looking for
exciting, invigorating, obscure, half-baked, funky, academic, practical and
impractical stories, use cases, experiments and down right successes alike
from within the Science domain. The only thing they need to have in common
is that they consume, contribute towards, advocate, disseminate or even
commercialize Apache technology within the Scientific domain and would be
relevant to that audience. It is fully open to interest whether this track
be combined with the proposed *healthcare track*... if there is interest to
do this then we can rename this track to Science and Healthcare. In essence
one could argue that they are one and the same however I digress [image: :)]

What I would like those of you that are interested to do, is to merely
check out the scope and intent of the Apache in Science content curation
which is currently ongoing and to potentially register your interest.

https://wiki.apache.org/apachecon/ACNA2015ContentCommittee#Apache_in_Science

I would love to see the Science and Healthcare track be THE BIGGEST track
@ApacheCon, and although we have some way to go, I'm sure many previous
track participants will tell you this is not to missed.

We are looking for content from a wide variety of Scientific use cases all
related to Apache technology.
Thanks in advance and I look forward to seeing you in Austin.
Lewis

-- 
*Lewis*


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269800#comment-14269800
 ] 

Tyler Palsulich commented on TIKA-1445:
---

Thanks guys! [~tallison], let me know once you finish running against govdocs  
and I'll roll a new RC.

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1509) Create configurable strategies for composite parsers

2015-01-08 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1509:
-

 Summary: Create configurable strategies for composite parsers
 Key: TIKA-1509
 URL: https://issues.apache.org/jira/browse/TIKA-1509
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


Several parsers can handle the same mime type, and we are currently ordering 
which parser is chosen (roughly) by the alphabetic order of the parser class 
name.

Let's allow users to configure strategies for picking parsers.

***NOTE: this description is just a place holder, will edit later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269782#comment-14269782
 ] 

Tim Allison commented on TIKA-1508:
---

TIKA-1508 emerged from a conversation started on TIKA-1445.

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.8
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1508) Add uniformity to parser parameter configuration

2015-01-08 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1508:
--
Summary: Add uniformity to parser parameter configuration  (was: Add 
uniformity to parser configuration)

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.8
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1508) Add uniformity to parser configuration

2015-01-08 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1508:
-

 Summary: Add uniformity to parser configuration
 Key: TIKA-1508
 URL: https://issues.apache.org/jira/browse/TIKA-1508
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
 Fix For: 1.8


We can currently configure parsers by the following means:
1) programmatically by direct calls to the parsers or their config objects
2) sending in a config object through the ParseContext
3) modifying .properties files for specific parsers (e.g. PDFParser)

Rather than scattering the landscape with .properties files for each parser, it 
would be great if we could specify parser parameters in the main config file, 
something along the lines of this:
{noformat}

  
2
something or other
  
  audio/basic
  audio/x-aiff
  audio/x-wav

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269768#comment-14269768
 ] 

Tim Allison commented on TIKA-1445:
---

Completely agree! Opening new issues now.

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269759#comment-14269759
 ] 

Tim Allison edited comment on TIKA-1445 at 1/8/15 5:57 PM:
---

I think we can call this resolved for now?

Many, many thanks to the collaboration on this one, [~tpalsulich], 
[~chrismattmann], [~gagravarr]!


was (Author: talli...@mitre.org):
I think we can call this resolved for now?

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269765#comment-14269765
 ] 

Nick Burch commented on TIKA-1445:
--

If we're going to close this for 1.7, then we need to pull out the "composite 
parser with strategy of what available parsers / parser combinations to use" as 
a new task for 1.8

Then we need to come up with some better names for the strategies :)

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1445.
---
Resolution: Fixed

I think we can call this resolved for now?

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Priority: Blocker  (was: Major)

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Fix Version/s: (was: 1.8)
   1.7

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269523#comment-14269523
 ] 

Tyler Palsulich commented on TIKA-1445:
---

Works for me!

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269454#comment-14269454
 ] 

Tim Allison commented on TIKA-1445:
---

I'll have time to rerun trunk against govdocs1 and compare with 1.6 by tomorrow 
(January 9) 10am EST.  If the community is willing to wait a day, let's hold 
off.  Another day might also allow others to identify small issues (similar to 
[~davemeikle]'s recent find).

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.7 Release

2015-01-08 Thread Peter Bowyer
+1.

Worked great once I manually
edited 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
and set useNonSequentialParser to true

Peter


Re: [VOTE] Apache Tika 1.7 Release

2015-01-08 Thread Hong-Thai Nguyen
Seems fine for me: +1

No big regression on our corpus test of 23K docs:

15-01-07 18:19:27 INFO  (DocumentConversionErrorPlugin.java : 116)
[pool-3-thread-1] Summary of document conversion errors:
- pdf (4)
* (2) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@4b0b2006
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@4b0b2006
* (1) org.apache.tika.exception.TikaException: Unable to extract PDF content
- ps (3)
* (3) org.apache.tika.exception.TikaException: Unable to unpack document
stream
- pptx (10)
* (9) org.apache.tika.exception.TikaException: Error creating OOXML
extractor
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@45df8db8
- doc (6)
* (6) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- ppt (14)
* (13) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
* (1) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@58797499
- xls (9)
* (9) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- vsd (3)
* (3) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- odp (2)
* (2) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@753ce4d8
- chm (1)
* (1) org.apache.tika.exception.TikaException: CHM file extract error:
extracted Length is wrong.
- dwg (4)
* (4) org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing
version: AC1014
- pps (2)
* (2) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- chw (1)
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@a0b8fca

Thank Tyler,

On Tue, Jan 6, 2015 at 7:59 AM, Tyler Palsulich 
wrote:

> Hi All,
>
> A candidate for the Tika 1.7 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/
>
> The SHA1 checksum of the archive is
> 0307a8367ae6f8b1103824fd11337fd89e24e6a4.
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.7.
>
> The vote is open for the next 72 hours and passes if a majority of at least
> three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.7
> [ ] -1 Do not release this package because...
>
> Thanks!
> Tyler
>
> P.S. Count this as my +1!
>



-- 
--
Hong-Thai