[jira] [Commented] (AVRO-1090) DataFileWriter should expose "sync marker" to allow concurrent writes to same .avro file

Catalin Alexandru Zamfir (JIRA) Thu, 17 May 2012 12:46:33 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278157#comment-13278157
 ]


Catalin Alexandru Zamfir commented on AVRO-1090:
------------------------------------------------

Confirmation: it works. Traversing means we're reading the data back. Tests 
have written about 700GB of data and are able to read it back.
{code}
22232086 [Main] INFO net.RnD.Hadoop.App - ## Traversed 1142M rows at: 
2:35:59.897
22239836 [Main] INFO net.RnD.Hadoop.App - ## Traversed 1143M rows at: 
2:36:07.647
22248107 [Main] INFO net.RnD.Hadoop.App - ## Traversed 1144M rows at: 
2:36:15.918
{code}
                
> DataFileWriter should expose "sync marker" to allow concurrent writes to same 
> .avro file
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-1090
>                 URL: https://issues.apache.org/jira/browse/AVRO-1090
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.3
>            Reporter: Catalin Alexandru Zamfir
>            Assignee: Doug Cutting
>             Fix For: 1.7.0
>
>         Attachments: AVRO-1090.patch, AVRO-1090.patch
>
>
> We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
> this with two threads per node, on 8 nodes. Some of the nodes share the same 
> path. For example, our: TimestampedWriter class, takes a path argument and 
> appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or 
> two nodes can access the same path. The "race" condition when these streams 
> are written, is resolved with a check to see if the file exists (has been 
> created) by a faster thread. If that's so, it appends, instead of creating 
> the file on the HDFS.
> The problem is that DataFileWriter, generates a 16-byte, random string for 
> each instance. So, two threads with 2 different writer instances, have a 
> different sync marker. That means that data, when trying to read it back, 
> will get an IOException ("Invalid sync!").
> There's a big performance penalty here. Because only one writer can write at 
> once to one given path, it becomes a bottleneck. For 1B (billion) rows, it 
> took us 4 hours to generate & load. With 20 concurrent threads, it took only 
> 12.5 minutes. 
> If DataFileWriter would expose the "sync" marker, a developer could read that 
> and make sure that the next thread that appends to the file, uses the same 
> sync marker. Don't know if it's even possible to expose the sync marker so as 
> other instances of "DataFileWriter" can share the sync marker, from the file. 
> We have a fix for this, making sure each writer is an "unique" instance and 
> generating a path based on that uniqueness. But instead of having 
> "SomePath/2012/05/14/Shard.avro" we'd now have 
> "SomePath/2012/05/14/Shard-some-random-UUID.avro" for each of the writers 
> that write the data in.
> If it can be done, it would be a huge fix for a bottleneck problem. The 
> bottleneck being the single writer that can write to a single path.
> THIS HAS ALSO been requested on the avro-user thread: 
> http://grokbase.com/t/avro/user/122m4sjm1y/is-it-possible-to-append-to-an-already-existing-avro-file
> I just could not find the JIRA ticket for this request.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1090) DataFileWriter should expose "sync marker" to allow concurrent writes to same .avro file

Reply via email to