[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-05-06 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864957#action_12864957
 ] 

Lance Norskog commented on LUCENE-2373:
---

.bq HDFS uses 64 or 128 _Mega_Byte blocks. 
Yet another reason to manage memory carefully. 

It should be possible to hit this watermark by using the NoMergePolicy and a 
RamBuffer size of 64M or 128M:. Hitting the RAMBuffer size causes a segment to 
flush to a file with little breakage (unused disk space), and it will never be 
merged again, cutting HDFS overheads. This should give a predictable and 
consistent segment writing overhead, right?


 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 4.0


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12862568#action_12862568
 ] 

Andrzej Bialecki  commented on LUCENE-2373:
---

HDFS uses 64 or 128 _Mega_Byte blocks.

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861361#action_12861361
 ] 

Andrzej Bialecki  commented on LUCENE-2373:
---

bq. I think that's a good plan - abstract the header write/read methods so that 
another codec can easily subclass to change how/where these are written. I 
think Lucene's default (standard) codec should continue to do what it does now? 
And then HDFS can take the standard codec, and subclass 
StandardTermsDictWriter/Reader to put the header at the end.

Assuming we add writeHeader/writeTrailer methods, the standard codec would 
write the header as it does today using writeHeader(), and in writeTrailer() it 
would just patch it the same way it does today.

{quote}bq.Codecs that operate on filesystems with unreliable fileLength 
could write a sync marker before the trailer, and there could be a 
back-tracking mechanism that starts from the reported fileLength and then tries 
to find the sync marker (reading back, and/or ahead).

Can't we just use the current standard codec's approach by default? 
Back-tracking seems dangerous. Eg what if .fileLength() is too small on such 
filesystems?
{quote}

Yes, of course, I was just dreaming up a filesystem that is both append-only 
and with unreliable fileLength ... not that I know of any off-hand :)

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-26 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861214#action_12861214
 ] 

Lance Norskog commented on LUCENE-2373:
---

Does this make it possible to add a good checksum? 

The Cloud and NRT architectures involve copying lots of segment files around, 
and diskRAMnetwork bandwidth all have error rates. It would be great if the 
process of making an index file included, on the fly, the creation of a solid 
checksum that is then baked into the file at the last moment. It should also be 
in the segments.gen file, but it is more important that the file should have 
the checksum embedded such that walking the whole file gives a fixed value.

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855877#action_12855877
 ] 

Shai Erera commented on LUCENE-2373:


I'd rather not count on file length as well ... so a put/getTermDictSize method 
on Codec will allow one to implement it however one wants, if running on HDFS 
for example?

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854409#action_12854409
 ] 

Michael McCandless commented on LUCENE-2373:


I would love to make Lucene truly write once (and moreve IndexOutput.seek), 
but... this approach makes me a little nervous...

In some environments, relying on the length of the file to be accurate might be 
risky: it's metadata, that can be subject to different client-side caching than 
the file's contents.  EG on NFS I've seen issues where the file length was 
stale yet the file contents were not.

Maybe we could offer a separate codec that takes this approach, for use on 
filesystems like HDFS that can't seek during write?  We should refactor 
standard codec so that where this long gets stored can be easily overridden 
by a subclass.

Or, alternatively, we could write this index of the index to a separate file?

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854240#action_12854240
 ] 

Andrzej Bialecki  commented on LUCENE-2373:
---

Just noticed that the same problem exists in SimpleStandardTermsIndexWriter, 
and I propose the same solution there.

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-06 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854254#action_12854254
 ] 

Earwin Burrfoot commented on LUCENE-2373:
-

And then IndexOutput.seek() can be deleted. Cool.

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org