[ 
https://issues.apache.org/jira/browse/LUCENE-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229275#comment-14229275
 ] 

Ramkumar Aiyengar commented on LUCENE-6073:
-------------------------------------------

bq. I'm confused about what looks like leniency in extract(). Does 
ExtractWikipedia do this too? Is there a good reason to ignore exceptions?

I didn't take a look at ExtractWikipedia, actually it might be affected by the 
same issue actually (of directory deletion) -- I will check. The only "good 
reason" was because the particular download I had happened to have bad data on 
one line, and it seemed reasonable to continue with other files in such a case 
as this was only benchmark data, at worst we would have had a few less docs..

bq. extractFile should just use java.io.LineNumberReader

Will check..

bq. is there any way to test this thing? there is a 20-line testfile in 
o.a.l.benchmark.byTask

I just checked this by {{ant get-files}} in the benchmark module (called by 
{{ant run-task}} eventually), this was failed before in trying to extract files 
on a clean checkout, with this change it no longer does. But did you mean 
through Jenkins as a proper test suite? Probably it could use one..

> Fix directory deletion in ExtractReuters, recover from errors
> -------------------------------------------------------------
>
>                 Key: LUCENE-6073
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6073
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/benchmark
>            Reporter: Ramkumar Aiyengar
>            Priority: Minor
>
> ExtractReuters in the benchmark module currently fails because it currently 
> creates the output directory, and then calls {{IOUtils.rm}} on it (which will 
> remove all files in it as well as removes the output directory itself). This 
> is to fix this behaviour.
> While I was at it, I also added a bit more logging in case of file errors 
> (the download I had some bad data) and made the task recover in case of 
> issues with one file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to