[ 
https://issues.apache.org/jira/browse/TEZ-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984592#comment-14984592
 ] 

Rajesh Balamohan edited comment on TEZ-2918 at 11/2/15 3:22 AM:
----------------------------------------------------------------

Patch introduces volatile updates in inner loops of the runtime.  Is it 
possible to avoid that as it would hurt performance.  For instance, 
pipelinedsorter does not have any sync or volatile in its core loop. In other 
places like merge, this is one of the reasons for introducing 
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS so that progress is updated every 10000 
records or so.  TEZ-1491, HADOOP-11029 provide details on the impact on counter 
stats with volatile and sync. 

Places where this can hurt are
- DefaultSorter's collect code path as it would get executed for every record.
- DefaultSorter's sortAndSpill, as quick sort would internally invoke 
progress() lots of times. Instead "progressable.progress()" can be invoked just 
before calling sort in sortAndSpill?
- MRReaderMapred.next(), MRReaderMapReduce.next(), 
MROutput.KeyValueWriter.write(), OrderedGroupedKVInput.getReader().next(), 
PipelinedSorter.collect(), UnorderedKVInput.getReader().next(), 
UnorderedKVReader.next(). If needed, should it make use of 
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS as they are hot code paths?

>> In UnorderedPartitionedKVWriter, {code} "writer.append(key, value); // ???? 
>> Why is outputrecordscounter not updated here?" {code} 
- Writer already has reference to outputRecordsCounter. So when it gets closed, 
outputRecordsCounter is automatically adjusted.

>> In PipelinedSorter, {code} // ??? why are events not being sent here? {code} 
- Final event is sent out by OrderedPartitionedKVOutput itself. So in cases 
where only one output is available, there is no need to send any events from 
within sorter itself.



was (Author: rajesh.balamohan):
Patch introduces volatile updates in inner loops of the runtime.  Is it 
possible to avoid that as it would hurt performance.  For instance, 
pipelinedsorter does not have any sync or volatile in its core loop. In other 
places like merge, this is one of the reasons for introducing 
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS so that progress is updated every 10000 
records or so.  TEZ-1491, HADOOP-11029 refer to counter stats on volatile and 
sync. Places where this can hurt are

- DefaultSorter's collect code path as it would get executed for every record.
- DefaultSorter's sortAndSpill, as quick sort would internally invoke 
progress() lots of times. Instead "progressable.progress()" can be invoked just 
before calling sort in sortAndSpill?
- MRReaderMapred.next(), MRReaderMapReduce.next(), 
MROutput.KeyValueWriter.write(), OrderedGroupedKVInput.getReader().next(), 
PipelinedSorter.collect(), UnorderedKVInput.getReader().next(), 
UnorderedKVReader.next(). If needed, should it make use of 
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS as they are hot code paths?

>> In UnorderedPartitionedKVWriter, {code} "writer.append(key, value); // ???? 
>> Why is outputrecordscounter not updated here?" {code} 
- Writer already has reference to outputRecordsCounter. So when it gets closed, 
outputRecordsCounter is automatically adjusted.

>> In PipelinedSorter, {code} // ??? why are events not being sent here? {code} 
- Final event is sent out by OrderedPartitionedKVOutput itself. So in cases 
where only one output is available, there is no need to send any events from 
within sorter itself.


> Make progress notifications in IOs
> ----------------------------------
>
>                 Key: TEZ-2918
>                 URL: https://issues.apache.org/jira/browse/TEZ-2918
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-2918.1.patch, TEZ-2918.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to