[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407034#comment-13407034 ]
Robert Muir commented on LUCENE-4190: ------------------------------------- The problem with a global filter is that what files a codec uses are an implementation detail of the codec. Currently today, a codec can name files pretty much whatever it wants (it must avoid _seg.cfs and segments_seg and segments.gen of course). In general other than exceptional cases, we know which files a codec owns because a codec writes the list of files that it uses for a segment into the SegmentInfo (http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40SegmentInfoFormat.html). The problem is these exceptional cases: how can IndexFileDeleter distinguish between leftover partially written index files for a segment and some files of the user, since it may not have the SegmentInfo (.si) for that segment? Previous attempts at this still didnt work well: * listing the extensions() in the codec is not great, e.g. Sep codec uses .doc extension for documents! * having the codec list the files it uses for a segment isnt easy and causes a mess: previously files() had to be symmetric at read and write time and we often had bugs in this, because the files used by the codec often depends upon various things like options the user chooses (e.g. did they enable term vectors, payloads, etc etc). I will do *anything* to prevent this from coming back! So in my opinion, the only real, third option is to restrict what file names a codec can use, in a way thats not a huge imposition to the codec. My patch on this issue (which people weren't happy with) did just this: it restricted file names to begin with an underscore. > IndexWriter deletes non-Lucene files > ------------------------------------ > > Key: LUCENE-4190 > URL: https://issues.apache.org/jira/browse/LUCENE-4190 > Project: Lucene - Java > Issue Type: Bug > Reporter: Michael McCandless > Assignee: Robert Muir > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4190.patch, LUCENE-4190.patch > > > Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog > post: > http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html > IndexWriter will now (as of 4.0) delete all foreign files from the index > directory. We made this change because Codecs are free to write to any files > now, so the space of filenames is hard to "bound". > But if the user accidentally uses the wrong directory (eg c:/) then we will > in fact delete important stuff. > I think we can at least use some simple criteria (must start with _, maybe > must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to > delete a non-Lucene file.... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org