Re: including refactored docs from govdocs1 in test suite

2015-03-31 Thread Mattmann, Chris A (3980)
+1 to including the modified docs.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 6:51 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: RE: including refactored docs from govdocs1 in test suite

I think this is an open question within Tika.  Some parsers prefer one
thing over another.  And there are different levels of corruption.

In the two cases where govdocs1 docs might be useful in tests, the
hyperlinks in .doc files do not appear to be standard, but  MSWord
opens them without a problem.  In cases where an application can open and
correctly process the content, I think we ought to try to extract content
without throwing exceptions.

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Monday, March 30, 2015 9:39 AM
To: dev@tika.apache.org
Subject: RE: including refactored docs from govdocs1 in test suite

Ah. I see.

In general, what is the goal with handling corrupted files? Extract as
much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote:

 Unfortunately, no.  MSOffice fixes the document when I do that.

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Monday, March 30, 2015 9:24 AM
 To: dev@tika.apache.org
 Subject: Re: including refactored docs from govdocs1 in test suite

 Can you copy the hyperlink into a new doc and change the URL? I have no
 idea about including the modified version.

 Tyler
 On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org
wrote:

  All,
 
As part of TIKA-1512, I found that I can delete all of the contents,
  including the metadata, except for one hyperlink in two documents from
  govdocs1 and still get the proper behavior -- fail before fix, work
after
  fix.
 
These documents are in the public domain.
 
Is it ok to include these modified documents in our test suite or
should
  I avoid inclusion?
 
Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
  we have time to discuss/determine way ahead... unless the answer is
obvious.
 
   Best,
 
   Tim
 
  -Original Message-
  From: Allison, Timothy B. [mailto:talli...@mitre.org]
  Sent: Monday, March 30, 2015 7:03 AM
  To: dev@tika.apache.org
  Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
 
  Unless there are objections, I'd like these to be resolved before 1.8:
 
  TIKA-1584 -- I'll fix
  TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
  TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
  I'll leave this open and do some more digging to see if we need to
open
a
  ticket at the POI level
  TIKA-1511 -- I'll remove provided for xerial
 
  TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
 
  I'll have these fixes completed by noon EDT.  Should I run against
  govdocs1 before or after the RC?
 
  My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
  before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my
last
  build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
  README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
  jars.
 
  Best,
 
Tim
 
 
 
  -Original Message-
  From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
  Sent: Sunday, March 29, 2015 9:13 AM
  To: dev@tika.apache.org
  Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
 
  Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
  something else pops up).
 
  Thank you everyone.
 
  Tyler
  On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:
 
   +1 for 1.8
  
   Hong-Thai
  
On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
  wrote:
   
Hi Folks,
   
Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
  to
release a new version of Tika. I'll volunteer to be the release
manager
again.
   
Should we release this as 1.8 or 1.7.1?
   
Does anyone have any last minute issues they'd like to finish and
see
  in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
TIKA-1586). Any others?
   
Have a good weekend,
Tyler
  
 



Re: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Tyler Palsulich
Can you copy the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.

Tyler
On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:

 All,

   As part of TIKA-1512, I found that I can delete all of the contents,
 including the metadata, except for one hyperlink in two documents from
 govdocs1 and still get the proper behavior -- fail before fix, work after
 fix.

   These documents are in the public domain.

   Is it ok to include these modified documents in our test suite or should
 I avoid inclusion?

   Happy to avoid inclusion for the sake of a quick release of 1.8 and then
 we have time to discuss/determine way ahead... unless the answer is obvious.

  Best,

  Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
 I'll leave this open and do some more digging to see if we need to open a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
   release a new version of Tika. I'll volunteer to be the release manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 



RE: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Allison, Timothy B.
Unfortunately, no.  MSOffice fixes the document when I do that.

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Monday, March 30, 2015 9:24 AM
To: dev@tika.apache.org
Subject: Re: including refactored docs from govdocs1 in test suite

Can you copy the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.

Tyler
On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:

 All,

   As part of TIKA-1512, I found that I can delete all of the contents,
 including the metadata, except for one hyperlink in two documents from
 govdocs1 and still get the proper behavior -- fail before fix, work after
 fix.

   These documents are in the public domain.

   Is it ok to include these modified documents in our test suite or should
 I avoid inclusion?

   Happy to avoid inclusion for the sake of a quick release of 1.8 and then
 we have time to discuss/determine way ahead... unless the answer is obvious.

  Best,

  Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
 I'll leave this open and do some more digging to see if we need to open a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
   release a new version of Tika. I'll volunteer to be the release manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 



RE: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Tyler Palsulich
Ah. I see.

In general, what is the goal with handling corrupted files? Extract as much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote:

 Unfortunately, no.  MSOffice fixes the document when I do that.

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Monday, March 30, 2015 9:24 AM
 To: dev@tika.apache.org
 Subject: Re: including refactored docs from govdocs1 in test suite

 Can you copy the hyperlink into a new doc and change the URL? I have no
 idea about including the modified version.

 Tyler
 On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:

  All,
 
As part of TIKA-1512, I found that I can delete all of the contents,
  including the metadata, except for one hyperlink in two documents from
  govdocs1 and still get the proper behavior -- fail before fix, work
after
  fix.
 
These documents are in the public domain.
 
Is it ok to include these modified documents in our test suite or
should
  I avoid inclusion?
 
Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
  we have time to discuss/determine way ahead... unless the answer is
obvious.
 
   Best,
 
   Tim
 
  -Original Message-
  From: Allison, Timothy B. [mailto:talli...@mitre.org]
  Sent: Monday, March 30, 2015 7:03 AM
  To: dev@tika.apache.org
  Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
 
  Unless there are objections, I'd like these to be resolved before 1.8:
 
  TIKA-1584 -- I'll fix
  TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
  TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
  I'll leave this open and do some more digging to see if we need to open
a
  ticket at the POI level
  TIKA-1511 -- I'll remove provided for xerial
 
  TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
 
  I'll have these fixes completed by noon EDT.  Should I run against
  govdocs1 before or after the RC?
 
  My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
  before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
  build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
  README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
  jars.
 
  Best,
 
Tim
 
 
 
  -Original Message-
  From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
  Sent: Sunday, March 29, 2015 9:13 AM
  To: dev@tika.apache.org
  Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
 
  Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
  something else pops up).
 
  Thank you everyone.
 
  Tyler
  On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:
 
   +1 for 1.8
  
   Hong-Thai
  
On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
  wrote:
   
Hi Folks,
   
Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
  to
release a new version of Tika. I'll volunteer to be the release
manager
again.
   
Should we release this as 1.8 or 1.7.1?
   
Does anyone have any last minute issues they'd like to finish and
see
  in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
TIKA-1586). Any others?
   
Have a good weekend,
Tyler
  
 


Re: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Konstantin Gribov
At least, parser should not hang on processing corrupted document. IMHO,
cases with hanging parser code should be considered blocker issue.

Personally I prefer variant with partial result and some meta which says
that document parsing failed somehow. But it can be hard to do.

-- 
Best regards,
Konstantin Gribov

пн, 30 марта 2015 г. в 16:52, Allison, Timothy B. talli...@mitre.org:

 I think this is an open question within Tika.  Some parsers prefer one
 thing over another.  And there are different levels of corruption.

 In the two cases where govdocs1 docs might be useful in tests, the
 hyperlinks in .doc files do not appear to be standard, but  MSWord opens
 them without a problem.  In cases where an application can open and
 correctly process the content, I think we ought to try to extract content
 without throwing exceptions.

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Monday, March 30, 2015 9:39 AM
 To: dev@tika.apache.org
 Subject: RE: including refactored docs from govdocs1 in test suite

 Ah. I see.

 In general, what is the goal with handling corrupted files? Extract as much
 as possible and fail gracefully?

 Tyler

 On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote:
 
  Unfortunately, no.  MSOffice fixes the document when I do that.
 
  -Original Message-
  From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
  Sent: Monday, March 30, 2015 9:24 AM
  To: dev@tika.apache.org
  Subject: Re: including refactored docs from govdocs1 in test suite
 
  Can you copy the hyperlink into a new doc and change the URL? I have no
  idea about including the modified version.
 
  Tyler
  On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org
 wrote:
 
   All,
  
 As part of TIKA-1512, I found that I can delete all of the contents,
   including the metadata, except for one hyperlink in two documents from
   govdocs1 and still get the proper behavior -- fail before fix, work
 after
   fix.
  
 These documents are in the public domain.
  
 Is it ok to include these modified documents in our test suite or
 should
   I avoid inclusion?
  
 Happy to avoid inclusion for the sake of a quick release of 1.8 and
 then
   we have time to discuss/determine way ahead... unless the answer is
 obvious.
  
Best,
  
Tim
  
   -Original Message-
   From: Allison, Timothy B. [mailto:talli...@mitre.org]
   Sent: Monday, March 30, 2015 7:03 AM
   To: dev@tika.apache.org
   Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
  
   Unless there are objections, I'd like these to be resolved before 1.8:
  
   TIKA-1584 -- I'll fix
   TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
   TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
 but
   I'll leave this open and do some more digging to see if we need to open
 a
   ticket at the POI level
   TIKA-1511 -- I'll remove provided for xerial
  
   TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
  
   I'll have these fixes completed by noon EDT.  Should I run against
   govdocs1 before or after the RC?
  
   My last build of Tika app (a few days ago) ballooned to ~43MB, and
 that's
   before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
   build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
   README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
 tika-server
   jars.
  
   Best,
  
 Tim
  
  
  
   -Original Message-
   From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
   Sent: Sunday, March 29, 2015 9:13 AM
   To: dev@tika.apache.org
   Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
  
   Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
   something else pops up).
  
   Thank you everyone.
  
   Tyler
   On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
 wrote:
  
+1 for 1.8
   
Hong-Thai
   
 On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
   wrote:

 Hi Folks,

 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
 need
   to
 release a new version of Tika. I'll volunteer to be the release
 manager
 again.

 Should we release this as 1.8 or 1.7.1?

 Does anyone have any last minute issues they'd like to finish and
 see
   in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
 and
 TIKA-1586). Any others?

 Have a good weekend,
 Tyler