Re: including refactored docs from govdocs1 in test suite
+1 to including the modified docs. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 6:51 AM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite I think this is an open question within Tika. Some parsers prefer one thing over another. And there are different levels of corruption. In the two cases where govdocs1 docs might be useful in tests, the hyperlinks in .doc files do not appear to be standard, but MSWord opens them without a problem. In cases where an application can open and correctly process the content, I think we ought to try to extract content without throwing exceptions. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:39 AM To: dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote: Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: including refactored docs from govdocs1 in test suite
Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
RE: including refactored docs from govdocs1 in test suite
Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
RE: including refactored docs from govdocs1 in test suite
Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote: Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: including refactored docs from govdocs1 in test suite
At least, parser should not hang on processing corrupted document. IMHO, cases with hanging parser code should be considered blocker issue. Personally I prefer variant with partial result and some meta which says that document parsing failed somehow. But it can be hard to do. -- Best regards, Konstantin Gribov пн, 30 марта 2015 г. в 16:52, Allison, Timothy B. talli...@mitre.org: I think this is an open question within Tika. Some parsers prefer one thing over another. And there are different levels of corruption. In the two cases where govdocs1 docs might be useful in tests, the hyperlinks in .doc files do not appear to be standard, but MSWord opens them without a problem. In cases where an application can open and correctly process the content, I think we ought to try to extract content without throwing exceptions. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:39 AM To: dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote: Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler