[
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407034#comment-13407034
]
Robert Muir commented on LUCENE-4190:
-------------------------------------
The problem with a global filter is that what files a codec uses are an
implementation detail of the codec. Currently today,
a codec can name files pretty much whatever it wants (it must avoid _seg.cfs
and segments_seg and segments.gen of course).
In general other than exceptional cases, we know which files a codec owns
because
a codec writes the list of files that it uses for a segment into the
SegmentInfo
(http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40SegmentInfoFormat.html).
The problem is these exceptional cases: how can IndexFileDeleter distinguish
between leftover partially written index files for a segment and some files of
the user, since it may not have the SegmentInfo (.si) for that segment?
Previous attempts at this still didnt work well:
* listing the extensions() in the codec is not great, e.g. Sep codec uses .doc
extension for documents!
* having the codec list the files it uses for a segment isnt easy and causes a
mess: previously files() had to be symmetric at read and write time and we
often had bugs in this, because the files used by the codec often depends upon
various things like options the user chooses (e.g. did they enable term
vectors, payloads, etc etc). I will do *anything* to prevent this from coming
back!
So in my opinion, the only real, third option is to restrict what file names a
codec can use, in a way thats not a huge imposition to the codec. My patch on
this issue (which people weren't happy with) did just this: it restricted file
names to begin with an underscore.
> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog
> post:
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index
> directory. We made this change because Codecs are free to write to any files
> now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will
> in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe
> must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to
> delete a non-Lucene file....
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]