tika-trunk-jdk1.7 - Build # 271 - Still Failing

2014-10-20 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #271)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/271/ to 
view the results.

tika-trunk-jdk1.6 - Build # 251 - Failure

2014-10-20 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.6 (build #251)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.6/251/ to 
view the results.

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176892#comment-14176892
 ] 

Andrew Jackson commented on TIKA-1302:
--

At the UK Web Archive we run Apache Tika over all our collections (it's been 
run over about 4 billion resources so far). We record the results in Apache 
Solr, to act as a search facet, and we also collect the Exceptions that are 
thrown when Tika fails. We can't make the content available to you directly, 
but perhaps there are datasets we can produce that would be useful to you? e.g. 
would a list of the exceptions that we've seen (along with the URL to the 
resource that caused the exception) be of interest?

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176900#comment-14176900
 ] 

Ken Krugler commented on TIKA-1302:
---

Andrew - that sounds amazing! Could you provide an example of such an 
exception, so we could see what information is currently being captured? And do 
you have any idea how many (of the 4B) are failing, and thus the size of the 
exception list? Thanks.

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176934#comment-14176934
 ] 

Andrew Jackson commented on TIKA-1302:
--

I have 2,358,167 errors from one collection (2 billion resources), but the 
majority are SAXParseExceptions. It's made up of UK web archive content from 
1996-2010, so there's lots of broken HTML/XML in there. If I strip out the 
SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps 
more interesting. 

Here's an example including the SAX exceptions:
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,"org.xml.sax.SAXParseException:
 Open quote is expected for attribute ""ID"" associated with an  element type  
""COMMENT""."
20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548&map_only=yes&type=overview,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
2006102004,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5&type=rss;action=.xml,org.xml.sax.SAXParseException:
 The character reference must end with the ';' delimiter.
20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=courses&sn=26,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20060926015856,http://www.qca.org.uk/4412.html,"org.xml.sax.SAXParseException: 
The entity ""nbsp"" was referenced\, but not declared."
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException:
 -1
20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=porter&cov=&mode=buy&o=4854130936&code=9123&cu=&,"org.xml.sax.SAXParseException:
 The element type ""META"" must be terminated by the matching end-tag 
."
20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException:
 The entity name must immediately follow the '&' in the entity reference.
{code}
...and for the others...
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException:
 null
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException:
 -1
20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException:
 Position 698368 past the end of the file
20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,"java.io.IOException:
 Error: Expected a long type\, actual='25#0/'"
20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException:
 oversubscribed literal/length tree
2004112115,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException:
 oversubscribed dynamic bit lengths tree
20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException:
 Default password is invalid for docId/saltData/saltHash
2005183952,http://freeweb.co.uk:80/show_nw.php?ref=258&target=B&show=aff&PHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html;
 
charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: 
Suspected zip bomb: 100 levels of

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
 ] 

Tim Allison commented on TIKA-1302:
---

That would be a fantastic resource.  Thank you for sharing!  We could do a bit 
of munging to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on 
the govdocs1 corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files 
easily enough?  I had about 6 in the govdocs1 corpus.

Thank you!

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177058#comment-14177058
 ] 

William Palmer commented on TIKA-1302:
--


I have left the British Library (as of 20th October 2014).  Please contact 
maureen.penn...@bl.uk if you need to contact someone.

Any FOI requests should be sent to foi-enquir...@bl.uk.


**
Experience the British Library online at www.bl.uk
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
 ] 

Tim Allison edited comment on TIKA-1302 at 10/20/14 5:22 PM:
-

That would be a fantastic resource.  Thank you for sharing!  We could do a bit 
of munging to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on 
the govdocs1 corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files 
easily enough?  I had about 6 in the govdocs1 corpus.

Thank you!

P.S. On the SAXParseExceptions...did those come from the XMLParser or from the 
HtmlParser?  I recently discovered that we hardcode an override in TikaResource 
within tika-server:
{noformat}
 parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
{noformat}

Not sure that we should hardcode that, but it does make sense to use that 
configuration!


was (Author: talli...@mitre.org):
That would be a fantastic resource.  Thank you for sharing!  We could do a bit 
of munging to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on 
the govdocs1 corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files 
easily enough?  I had about 6 in the govdocs1 corpus.

Thank you!

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177877#comment-14177877
 ] 

Chris A. Mattmann commented on TIKA-1302:
-

[~anjackson] thanks for sharing. [~gostep] has been working in this area and is 
currently running Tika in an HPC environment against govdocs (as is 
[~talli...@apache.org]). It would be great to coordinate here in Tika. Thanks 
for sharing this.

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika 1.6 update in Maven Central?

2014-10-20 Thread Mattmann, Chris A (3980)
Hi Aeham,

We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully
working on Windows first.

Any one of the other devs having things we should get into 1.7?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Aeham Abushwashi 
Reply-To: "u...@tika.apache.org" 
Date: Monday, October 20, 2014 at 4:27 PM
To: "u...@tika.apache.org" 
Subject: Tika 1.6 update in Maven Central?

>Hi,
>
>We use Tika 1.6, which is pulled, along with all of its dependencies, via
>maven. We've hit some issues with the conversion of 7z files but I
>believe these issues are addressed by recent changes (r1623593).
>
>Unfortunately, the 1.6 artifacts in the central maven repository are a
>couple of months old and predate the fix.
>
>Any ideas if/when the artifacts would be updated with the latest and
>greatest working code?
>
>
>Any suggestions for workarounds would be greatly appreciated too.
>
>Many thanks,
>Aeham



Re: 1.7 release?

2014-10-20 Thread Mattmann, Chris A (3980)
Hmm any idea why this is failing on Windows? Tyler P. and
I were talking the other day - maybe we shouldn't run the
tests from TIKA-1422 unless Tesseract is installed? Thoughts?

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Hong-Thai Nguyen 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, October 16, 2014 at 2:03 AM
To: "dev@tika.apache.org" 
Subject: Re: 1.7 release?

>Hi Andrzej,
>
>We are impatient for 1.7 release too.
>I'm having compiling problem of TIKA-1422 on me. If anyone can build
>successfully on Windows, I have no objection to release 1.7
>
>Thanks,
>
>On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki  wrote:
>
>> Hi,
>>
>> Any news on the 1.7 release? or at least a 1.6.1 release that includes
>>the
>> fix for broken ODF parsing...
>>
>> ---
>> Best regards,
>>
>> Andrzej Bialecki
>>
>>
>
>
>-- 
>--
>Hong-Thai



[jira] [Updated] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-20 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1451:
--
Attachment: integrate_recursive_metadata_wrapper.patch

This is v1 for the patch.  Any and all feedback is welcomed.

I don't like using mark/reset and double parsing for the gui, but I don't see 
an alternative (at least to double parsing).

Any recommendations?

> Add Recursive Metadata Parser Wrapper output to tika-app and gui
> 
>
> Key: TIKA-1451
> URL: https://issues.apache.org/jira/browse/TIKA-1451
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: integrate_recursive_metadata_wrapper.patch
>
>
> It would be helpful to expose the output of the recursive metadata parser 
> wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-20 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1451:
-

 Summary: Add Recursive Metadata Parser Wrapper output to tika-app 
and gui
 Key: TIKA-1451
 URL: https://issues.apache.org/jira/browse/TIKA-1451
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.7


It would be helpful to expose the output of the recursive metadata parser 
wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Hi, I can try this on.
What is a trunk?


Thanks,
Oleg

On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hmm any idea why this is failing on Windows? Tyler P. and
> I were talking the other day - maybe we shouldn't run the
> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Hong-Thai Nguyen 
> Reply-To: "dev@tika.apache.org" 
> Date: Thursday, October 16, 2014 at 2:03 AM
> To: "dev@tika.apache.org" 
> Subject: Re: 1.7 release?
>
> >Hi Andrzej,
> >
> >We are impatient for 1.7 release too.
> >I'm having compiling problem of TIKA-1422 on me. If anyone can build
> >successfully on Windows, I have no objection to release 1.7
> >
> >Thanks,
> >
> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki  wrote:
> >
> >> Hi,
> >>
> >> Any news on the 1.7 release? or at least a 1.6.1 release that includes
> >>the
> >> fix for broken ODF parsing...
> >>
> >> ---
> >> Best regards,
> >>
> >> Andrzej Bialecki
> >>
> >>
> >
> >
> >--
> >--
> >Hong-Thai
>
>


Re: 1.7 release?

2014-10-20 Thread Mattmann, Chris A (3980)
Trunk is the current checkout/branch:

http://svn.apache.org/repos/asf/tika/trunk


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Oleg Tikhonov 
Reply-To: "dev@tika.apache.org" 
Date: Monday, October 20, 2014 at 10:16 PM
To: "dev@tika.apache.org" 
Subject: Re: 1.7 release?

>Hi, I can try this on.
>What is a trunk?
>
>
>Thanks,
>Oleg
>
>On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hmm any idea why this is failing on Windows? Tyler P. and
>> I were talking the other day - maybe we shouldn't run the
>> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Hong-Thai Nguyen 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Thursday, October 16, 2014 at 2:03 AM
>> To: "dev@tika.apache.org" 
>> Subject: Re: 1.7 release?
>>
>> >Hi Andrzej,
>> >
>> >We are impatient for 1.7 release too.
>> >I'm having compiling problem of TIKA-1422 on me. If anyone can build
>> >successfully on Windows, I have no objection to release 1.7
>> >
>> >Thanks,
>> >
>> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
>>wrote:
>> >
>> >> Hi,
>> >>
>> >> Any news on the 1.7 release? or at least a 1.6.1 release that
>>includes
>> >>the
>> >> fix for broken ODF parsing...
>> >>
>> >> ---
>> >> Best regards,
>> >>
>> >> Andrzej Bialecki
>> >>
>> >>
>> >
>> >
>> >--
>> >--
>> >Hong-Thai
>>
>>



Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Taken. Thanks. in progress ...

On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Trunk is the current checkout/branch:
>
> http://svn.apache.org/repos/asf/tika/trunk
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Oleg Tikhonov 
> Reply-To: "dev@tika.apache.org" 
> Date: Monday, October 20, 2014 at 10:16 PM
> To: "dev@tika.apache.org" 
> Subject: Re: 1.7 release?
>
> >Hi, I can try this on.
> >What is a trunk?
> >
> >
> >Thanks,
> >Oleg
> >
> >On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >> Hmm any idea why this is failing on Windows? Tyler P. and
> >> I were talking the other day - maybe we shouldn't run the
> >> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Hong-Thai Nguyen 
> >> Reply-To: "dev@tika.apache.org" 
> >> Date: Thursday, October 16, 2014 at 2:03 AM
> >> To: "dev@tika.apache.org" 
> >> Subject: Re: 1.7 release?
> >>
> >> >Hi Andrzej,
> >> >
> >> >We are impatient for 1.7 release too.
> >> >I'm having compiling problem of TIKA-1422 on me. If anyone can build
> >> >successfully on Windows, I have no objection to release 1.7
> >> >
> >> >Thanks,
> >> >
> >> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
> >>wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Any news on the 1.7 release? or at least a 1.6.1 release that
> >>includes
> >> >>the
> >> >> fix for broken ODF parsing...
> >> >>
> >> >> ---
> >> >> Best regards,
> >> >>
> >> >> Andrzej Bialecki
> >> >>
> >> >>
> >> >
> >> >
> >> >--
> >> >--
> >> >Hong-Thai
> >>
> >>
>
>


[jira] [Updated] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-20 Thread Oleg Tikhonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Tikhonov updated TIKA-1422:

Attachment: TIKA-1422.oleg.20141021.patch

Were missing imports of image parsers in the TesseractOCRParser unit test.

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
> TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
> TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-20 Thread Oleg Tikhonov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178018#comment-14178018
 ] 

Oleg Tikhonov edited comment on TIKA-1422 at 10/21/14 6:19 AM:
---

Were missing imports of image parsers in the TesseractOCRParser unit test.

Env:
Windows 7, PE, x64. 
java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Output:
After import image parsers:
[INFO] 
[INFO] Building Apache Tika 1.7-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ tika ---
[INFO] Deleting E:\work_dir\tika\tika-site\target
[INFO]
[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---
[INFO]
[INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika 
---
[INFO]
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika ---
[INFO] Installing E:\work_dir\tika\tika-site\pom.xml to 
\.m2\repository\org\apache\tika\tika\1.7-SNAPSHOT\tika-1.7-SNAPSHOT.pom
[INFO] 
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent  SUCCESS [1.093s]
[INFO] Apache Tika core .. SUCCESS [14.594s]
[INFO] Apache Tika parsers ... SUCCESS [49.359s]
[INFO] Apache Tika XMP ... SUCCESS [1.161s]
[INFO] Apache Tika serialization . SUCCESS [1.311s]
[INFO] Apache Tika application ... SUCCESS [11.725s]
[INFO] Apache Tika OSGi bundle ... SUCCESS [19.826s]
[INFO] Apache Tika server  SUCCESS [15.705s]
[INFO] Apache Tika translate . SUCCESS [1.476s]
[INFO] Apache Tika examples .. SUCCESS [2.231s]
[INFO] Apache Tika Java-7 Components . SUCCESS [1.429s]
[INFO] Apache Tika ... SUCCESS [0.029s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 2:00.578s
[INFO] Finished at: Tue Oct 21 08:12:17 IST 2014
[INFO] Final Memory: 67M/1156M
[INFO] 



was (Author: olegt):
Were missing imports of image parsers in the TesseractOCRParser unit test.

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
> TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
> TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLCont

Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Please take a try with newest patch.
Cheers,
Oleg

On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
wrote:

> Taken. Thanks. in progress ...
>
> On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Trunk is the current checkout/branch:
>>
>> http://svn.apache.org/repos/asf/tika/trunk
>>
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Oleg Tikhonov 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Monday, October 20, 2014 at 10:16 PM
>> To: "dev@tika.apache.org" 
>> Subject: Re: 1.7 release?
>>
>> >Hi, I can try this on.
>> >What is a trunk?
>> >
>> >
>> >Thanks,
>> >Oleg
>> >
>> >On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
>> >chris.a.mattm...@jpl.nasa.gov> wrote:
>> >
>> >> Hmm any idea why this is failing on Windows? Tyler P. and
>> >> I were talking the other day - maybe we shouldn't run the
>> >> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
>> >>
>> >> ++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattm...@nasa.gov
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -Original Message-
>> >> From: Hong-Thai Nguyen 
>> >> Reply-To: "dev@tika.apache.org" 
>> >> Date: Thursday, October 16, 2014 at 2:03 AM
>> >> To: "dev@tika.apache.org" 
>> >> Subject: Re: 1.7 release?
>> >>
>> >> >Hi Andrzej,
>> >> >
>> >> >We are impatient for 1.7 release too.
>> >> >I'm having compiling problem of TIKA-1422 on me. If anyone can build
>> >> >successfully on Windows, I have no objection to release 1.7
>> >> >
>> >> >Thanks,
>> >> >
>> >> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
>> >>wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> Any news on the 1.7 release? or at least a 1.6.1 release that
>> >>includes
>> >> >>the
>> >> >> fix for broken ODF parsing...
>> >> >>
>> >> >> ---
>> >> >> Best regards,
>> >> >>
>> >> >> Andrzej Bialecki
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >--
>> >> >--
>> >> >Hong-Thai
>> >>
>> >>
>>
>>
>


Re: 1.7 release?

2014-10-20 Thread Mattmann, Chris A (3980)
Thanks Oleg, will try tomorrow for me Los angeles time!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Oleg Tikhonov 
Reply-To: "dev@tika.apache.org" 
Date: Monday, October 20, 2014 at 11:20 PM
To: "dev@tika.apache.org" 
Subject: Re: 1.7 release?

>Please take a try with newest patch.
>Cheers,
>Oleg
>
>On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
>wrote:
>
>> Taken. Thanks. in progress ...
>>
>> On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>>> Trunk is the current checkout/branch:
>>>
>>> http://svn.apache.org/repos/asf/tika/trunk
>>>
>>>
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -Original Message-
>>> From: Oleg Tikhonov 
>>> Reply-To: "dev@tika.apache.org" 
>>> Date: Monday, October 20, 2014 at 10:16 PM
>>> To: "dev@tika.apache.org" 
>>> Subject: Re: 1.7 release?
>>>
>>> >Hi, I can try this on.
>>> >What is a trunk?
>>> >
>>> >
>>> >Thanks,
>>> >Oleg
>>> >
>>> >On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
>>> >chris.a.mattm...@jpl.nasa.gov> wrote:
>>> >
>>> >> Hmm any idea why this is failing on Windows? Tyler P. and
>>> >> I were talking the other day - maybe we shouldn't run the
>>> >> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
>>> >>
>>> >> ++
>>> >> Chris Mattmann, Ph.D.
>>> >> Chief Architect
>>> >> Instrument Software and Science Data Systems Section (398)
>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >> Office: 168-519, Mailstop: 168-527
>>> >> Email: chris.a.mattm...@nasa.gov
>>> >> WWW:  http://sunset.usc.edu/~mattmann/
>>> >> ++
>>> >> Adjunct Associate Professor, Computer Science Department
>>> >> University of Southern California, Los Angeles, CA 90089 USA
>>> >> ++
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> -Original Message-
>>> >> From: Hong-Thai Nguyen 
>>> >> Reply-To: "dev@tika.apache.org" 
>>> >> Date: Thursday, October 16, 2014 at 2:03 AM
>>> >> To: "dev@tika.apache.org" 
>>> >> Subject: Re: 1.7 release?
>>> >>
>>> >> >Hi Andrzej,
>>> >> >
>>> >> >We are impatient for 1.7 release too.
>>> >> >I'm having compiling problem of TIKA-1422 on me. If anyone can
>>>build
>>> >> >successfully on Windows, I have no objection to release 1.7
>>> >> >
>>> >> >Thanks,
>>> >> >
>>> >> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
>>> >>wrote:
>>> >> >
>>> >> >> Hi,
>>> >> >>
>>> >> >> Any news on the 1.7 release? or at least a 1.6.1 release that
>>> >>includes
>>> >> >>the
>>> >> >> fix for broken ODF parsing...
>>> >> >>
>>> >> >> ---
>>> >> >> Best regards,
>>> >> >>
>>> >> >> Andrzej Bialecki
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >--
>>> >> >--
>>> >> >Hong-Thai
>>> >>
>>> >>
>>>
>>>
>>



Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Sorry!!!

On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Thanks Oleg, will try tomorrow for me Los angeles time!
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Oleg Tikhonov 
> Reply-To: "dev@tika.apache.org" 
> Date: Monday, October 20, 2014 at 11:20 PM
> To: "dev@tika.apache.org" 
> Subject: Re: 1.7 release?
>
> >Please take a try with newest patch.
> >Cheers,
> >Oleg
> >
> >On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
> >wrote:
> >
> >> Taken. Thanks. in progress ...
> >>
> >> On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) <
> >> chris.a.mattm...@jpl.nasa.gov> wrote:
> >>
> >>> Trunk is the current checkout/branch:
> >>>
> >>> http://svn.apache.org/repos/asf/tika/trunk
> >>>
> >>>
> >>> ++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattm...@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -Original Message-
> >>> From: Oleg Tikhonov 
> >>> Reply-To: "dev@tika.apache.org" 
> >>> Date: Monday, October 20, 2014 at 10:16 PM
> >>> To: "dev@tika.apache.org" 
> >>> Subject: Re: 1.7 release?
> >>>
> >>> >Hi, I can try this on.
> >>> >What is a trunk?
> >>> >
> >>> >
> >>> >Thanks,
> >>> >Oleg
> >>> >
> >>> >On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) <
> >>> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >>> >
> >>> >> Hmm any idea why this is failing on Windows? Tyler P. and
> >>> >> I were talking the other day - maybe we shouldn't run the
> >>> >> tests from TIKA-1422 unless Tesseract is installed? Thoughts?
> >>> >>
> >>> >> ++
> >>> >> Chris Mattmann, Ph.D.
> >>> >> Chief Architect
> >>> >> Instrument Software and Science Data Systems Section (398)
> >>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> >> Office: 168-519, Mailstop: 168-527
> >>> >> Email: chris.a.mattm...@nasa.gov
> >>> >> WWW:  http://sunset.usc.edu/~mattmann/
> >>> >> ++
> >>> >> Adjunct Associate Professor, Computer Science Department
> >>> >> University of Southern California, Los Angeles, CA 90089 USA
> >>> >> ++
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> -Original Message-
> >>> >> From: Hong-Thai Nguyen 
> >>> >> Reply-To: "dev@tika.apache.org" 
> >>> >> Date: Thursday, October 16, 2014 at 2:03 AM
> >>> >> To: "dev@tika.apache.org" 
> >>> >> Subject: Re: 1.7 release?
> >>> >>
> >>> >> >Hi Andrzej,
> >>> >> >
> >>> >> >We are impatient for 1.7 release too.
> >>> >> >I'm having compiling problem of TIKA-1422 on me. If anyone can
> >>>build
> >>> >> >successfully on Windows, I have no objection to release 1.7
> >>> >> >
> >>> >> >Thanks,
> >>> >> >
> >>> >> >On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
> >>> >>wrote:
> >>> >> >
> >>> >> >> Hi,
> >>> >> >>
> >>> >> >> Any news on the 1.7 release? or at least a 1.6.1 release that
> >>> >>includes
> >>> >> >>the
> >>> >> >> fix for broken ODF parsing...
> >>> >> >>
> >>> >> >> ---
> >>> >> >> Best regards,
> >>> >> >>
> >>> >> >> Andrzej Bialecki
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >--
> >>> >> >--
> >>> >> >Hong-Thai
> >>> >>
> >>> >>
> >>>
> >>>
> >>
>
>


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178036#comment-14178036
 ] 

Lewis John McGibbney commented on TIKA-1423:


Hi [~vinegh] how is this coming on? Would you like a hand? It would be great to 
get this in to Tika 1.7

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika 1.6 update in Maven Central?

2014-10-20 Thread Lewis John Mcgibbney
Hi Chris,

On Mon, Oct 20, 2014 at 11:37 PM,  wrote:

>
> We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully
> working on Windows first.
>
> Any one of the other devs having things we should get into 1.7?
>
>
I would very much like to see
https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are nearly
there, we merely need to write unit tests, document methods, build this
into a patch and submit it ti Jira for review. I will work with Vineet to
get this straightened out.
Thanks
Lewis


Re: Tika 1.6 update in Maven Central?

2014-10-20 Thread Mattmann, Chris A (3980)
Thanks Lewis!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney 
Reply-To: "dev@tika.apache.org" 
Date: Monday, October 20, 2014 at 11:48 PM
To: "dev@tika.apache.org" 
Subject: Re: Tika 1.6 update in Maven Central?

>Hi Chris,
>
>On Mon, Oct 20, 2014 at 11:37 PM,  wrote:
>
>>
>> We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully
>> working on Windows first.
>>
>> Any one of the other devs having things we should get into 1.7?
>>
>>
>I would very much like to see
>https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are
>nearly
>there, we merely need to write unit tests, document methods, build this
>into a patch and submit it ti Jira for review. I will work with Vineet to
>get this straightened out.
>Thanks
>Lewis



[jira] [Assigned] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned TIKA-1423:
--

Assignee: Lewis John McGibbney

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Assignee: Lewis John McGibbney
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)