Re: concurrency

Harsh J Fri, 12 Oct 2012 09:18:36 -0700

Joep,

You're right - I missed in my quick scan that he was actually
replacing those files there. Sorry for the confusion Koert!


On Fri, Oct 12, 2012 at 9:37 PM, J. Rottinghuis <jrottingh...@gmail.com> wrote:
> Hi Harsh, Moge Koert,
>
> If Koerts problem is similar to what I have been thinking about where we
> want to consolidate and re-compress older datasets, then the _SUCCESS does
> not really help. _SUCCESS helps to tell if a new dataset is completely
> written.
> However, what is needed here is to replace an existing dataset.
>
> Naive approach:
> The new set can be generated in parallel. Old directory moved out of the
> way (rm and therefore moved to Trash) and then he new directory renamed
> into place.
> I think the problem Koert is describing is how to not mess up map-reduce
> jobs that have already started and may have read some, but not all of the
> files in the directory. If you're lucky, then you'll try to read a file
> that is no longer there, but if you're unlucky then you read a new file
> with the same name and you will never know that you have inconsistent
> results.
>
> Trying to be clever approach:
> Every query puts a "lock" file with the job-id in the directory they read.
> Only when there are no locks, replace the data-set as describe in the naive
> approach. This will reduce the odds for problems, but is rife with
> race-conditions. Also, if the data is read-heavy, you may never get to
> replace the directory. Now you need a write lock to prevent new reads from
> starting.
>
> Would hardlinks solve this problem?
> Simply create a set of (temporary) hardlinks to the files in the directory
> you want to read? Then if the old set is moved out of the way, the
> hardlinks should still point to them. The reading job reads from the
> hardlinks and cleans them up when done. If the hardlinks are placed in a
> directory with the reading job-id then garbage collection should be
> possible for crashed jobs if normal cleanup fails.
>
> Groetjes,
>
> Joep
>
> On Fri, Oct 12, 2012 at 8:35 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Koert,
>>
>> Yes the _SUCCESS (Created on successful commit-end of a job) file
>> existence may be checked before firing the new job with the chosen
>> input directory. This is consistent with what Oozie does as well.
>>
>> Since the listing of files happens post-submit() call, doing this will
>> "just work" :)
>>
>> On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > We have a dataset that is heavily partitioned, like this
>> > /data
>> >   partition1/
>> >     _SUCESS
>> >     part-00000
>> >     part-00001
>> >     ...
>> >   partition1/
>> >     _SUCCESS
>> >     part-00000
>> >     part-00001
>> >     ....
>> >   ...
>> >
>> > We have loaders that use map-red jobs to add new partitions to this data
>> > set at a regular interval (so they write to new sub-directories).
>> >
>> > We also have map-red queries that read from the entire dataset (/data/*).
>> > My worry here is concurrency. It will happen that a query job runs
>> > while a loader
>> > job is adding a new partition at the same time. Is there a risk that the
>> query
>> > could read incomplete or corrupt files? Is there a way to use the _SUCESS
>> > files to prevent this from happening?
>> > Thanks for your time!
>> > Best,
>> > Koert
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J

Re: concurrency

Reply via email to