[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407034#comment-13407034
 ] 

Robert Muir commented on LUCENE-4190:
-------------------------------------

The problem with a global filter is that what files a codec uses are an 
implementation detail of the codec. Currently today,
a codec can name files pretty much whatever it wants (it must avoid _seg.cfs 
and segments_seg and segments.gen of course).
 
In general other than exceptional cases, we know which files a codec owns 
because
a codec writes the list of files that it uses for a segment into the 
SegmentInfo 
(http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40SegmentInfoFormat.html).

The problem is these exceptional cases: how can IndexFileDeleter distinguish 
between leftover partially written index files for a segment and some files of 
the user, since it may not have the SegmentInfo (.si) for that segment?

Previous attempts at this still didnt work well:
* listing the extensions() in the codec is not great, e.g. Sep codec uses .doc 
extension for documents!
* having the codec list the files it uses for a segment isnt easy and causes a 
mess: previously files() had to be symmetric at read and write time and we 
often had bugs in this, because the files used by the codec often depends upon 
various things like options the user chooses (e.g. did they enable term 
vectors, payloads, etc etc). I will do *anything* to prevent this from coming 
back!

So in my opinion, the only real, third option is to restrict what file names a 
codec can use, in a way thats not a huge imposition to the codec. My patch on 
this issue (which people weren't happy with) did just this: it restricted file 
names to begin with an underscore.

                
> IndexWriter deletes non-Lucene files
> ------------------------------------
>
>                 Key: LUCENE-4190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4190
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Robert Muir
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
> post: 
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index 
> directory.  We made this change because Codecs are free to write to any files 
> now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will 
> in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe 
> must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to 
> delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to