RE: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Kevin Oliver Mon, 13 Nov 2006 09:02:41 -0800

We considered patching this code when we ran into a data consistency
issue bug with our file system. It wasn't too difficult to patch
CompoundFileWriter to output lengths instead of offsets.


Naturally, I can't seem to find my implementation of this, but as I
recall it wasn't too difficult to do. I think it also required creating
a format version number at the beginning of the cfs file so that
CompoundFileReader knew what the data format was. 

Anyways, just want to point out that there was an easy enough workaround
for this that didn't require creating a new set of files. 

-----Original Message-----
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 11, 2006 8:04 AM
To: [email protected]
Subject: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop
distributed file system

    [
http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_124
48989 ] 
            
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Alas, in trying to change the CFS format so that file offsets are stored
at the end of the file, when implementing the corresponding changes to
CompoundFileReader, I discovered that this approach isn't viable.  I had
been thinking the reader would look at the file length, subtract
numEntry*sizeof(long), seek to there, and then read the offsets (longs).
The problem is: we can't know sizeof(long) since this is dependent on
the actual storage implementation, ie, for the same reasoning above.  Ie
we can't assume a byte = 1 file position, always.

So, then, the only solution I can think of (to avoid seek during write)
would be to write to a separate file, for each *.cfs file, that contains
the file offsets corresponding to the cfs file.  Eg, if we have _1.cfs
we would also have _1.cfsx which holds the file offsets.   This is sort
of costly if we care about # files (it doubles the number of files in
the simple case of a bunch of segments w/ no deletes/separate norms).

Yonik had actually mentioned in LUCENE-704 that fixing CFS writing to
not use seek was not very important, ie, it would be OK to not use
compound files with HDFS as the store.

Does anyone see a better approach?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch,
TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene
indexes on Hadoop distributed file system. When we tried to do it
directly on DFS using Nutch FsDirectory class - we immediately found
that indexing fails because DfsIndexOutput.seek() method throws
UnsupportedOperationException. The reason for this behavior is clear -
DFS does not support random updates and so seek() method can't be
supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we
really need them? Search in the Lucene code revealed 2 places which call
IndexOutput.seek() method: one is in TermInfosWriter and another one in
CompoundFileWriter. As we weren't planning to use CompoundFileWriter -
the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write
total number of terms in the file back into the beginning of the file.
It was very simple to change file format a little bit and write number
of terms into last 8 bytes of the file instead of writing them into
beginning of file. The only other place that should be fixed in order
for this to work is in SegmentTermEnum constructor - to read this piece
of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index
directly to DFS without any problems. Well - we still don't index
directly to DFS for performance reasons, but at least we can build small
local indexes and merge them into the main index on DFS without copying
big main index back and forth. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Reply via email to