[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

Lance Norskog (JIRA) Thu, 29 Apr 2010 14:07:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862407#action_12862407
 ]


Lance Norskog commented on LUCENE-2373:
---------------------------------------

bq. Lance: yes. The original use case I had in mind was HDFS (Hadoop File 
System) which already implements on-the-fly checksums. If we go the way that 
Mike suggested, i.e. implementing a separate codec, then this should be a 
simple addition. We could also perhaps structure this as a codec wrapper so 
that this capability can be applied to other codecs too.

+1 for in Lucene itself. Lots of large installations don't use HDFS to move 
shards around. Also, the HDFS checksum only counts after the file has touched 
down at the HDFS portal: there are error rates in local RAM, local hard disk, 
shared file systems and network I/O. Doing the checksum at the origin is more 
useful.

> Change StandardTermsDictWriter to work with streaming and append-only 
> filesystems
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-2373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2373
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>
> Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
> of the terms dict into a place near the start of the output data file. This 
> however made it impossible to use Lucene with append-only filesystems such as 
> HDFS.
> In the post-flex trunk the following code in StandardTermsDictWriter 
> initiates this:
> {code}
>     // Count indexed fields up front
>     CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
>     out.writeLong(0);                             // leave space for end 
> index pointer
> {code}
> and completes this in close():
> {code}
>       out.seek(CodecUtil.headerLength(CODEC_NAME));
>       out.writeLong(dirStart);
> {code}
> I propose to change this layout so that this pointer is stored simply at the 
> end of the file. It's always 8 bytes long, and we known the final length of 
> the file from Directory, so it's a single additional seek(length - 8) to read 
> it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

Reply via email to