[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

2024-01-04 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802725#comment-17802725
 ] 

Shilun Fan commented on YARN-6382:
--

Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a 
blocker. Retarget 3.5.0.

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -
>
> Key: YARN-6382
> URL: https://issues.apache.org/jira/browse/YARN-6382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Yousef Abu-Salah
>Priority: Major
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

2017-04-06 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960241#comment-15960241
 ] 

Haibo Chen commented on YARN-6382:
--

Nice catch! I have filed YARN-6455 to improve it.

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -
>
> Key: YARN-6382
> URL: https://issues.apache.org/jira/browse/YARN-6382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

2017-04-06 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960175#comment-15960175
 ] 

Joep Rottinghuis commented on YARN-6382:


Clarified with [~haibochen] that he meant that the race conditions for the 
latter two cases are solved in YARN-6376.
That makes sense.

Synchronizing on the writer is still a little brittle there, because there is a 
getWriter method which lets callers access the writer without synchronizing on 
it.
AppLevelTimelineCollector#AppLevelAggregator#agregate() does this in line 152: 
getWriter().write(...
In this case it doesn't flush, but if that were to be added, that would 
re-introduce the race fixed in YARN-6376.
Instead of exposing the writer, perhaps it would be better to have the 
sub-classes call #putEntities instead. It defers to the private 
writeTimelineEntities which does the same work to get the context:
TimelineCollectorContext context = getTimelineEntityContext();
Should we open a separate bug for that to enhance the fix in YARN-6376?

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -
>
> Key: YARN-6382
> URL: https://issues.apache.org/jira/browse/YARN-6382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

2017-04-03 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954358#comment-15954358
 ] 

Haibo Chen commented on YARN-6382:
--

Thanks for the nice summary [~jrottinghuis]! 
bq. This write causes the buffer to be full, or perhaps thread B calls flush, 
or a timer calls flush.
The latter two cases have been fixed by YARN-6357, so we only need to concern 
ourselves with the case where the buffer to be full.

I believe, what I was mostly concerned about, losing data due to intermittent 
connection issues and this race condition, is only an issue if there is no 
spooling support. 
Assuming most data/entities are not problematic, that is, a flush will not fail 
because of the data itself and subsequent retries will eventually write the 
data successfully in HBase, we can provide enough guarantee that good entities 
are all going to be eventually persisted in HBase. 
Given that most of what b) solves will go away when we have the spooling 
writer, I agree that we could just document the issue for now. Once we get the 
spooling writer, we can come back and revisit this to address what we want to 
do with malformed/problematic entities.

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -
>
> Key: YARN-6382
> URL: https://issues.apache.org/jira/browse/YARN-6382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>  Labels: yarn-5355-merge-blocker
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

2017-04-03 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954009#comment-15954009
 ] 

Joep Rottinghuis commented on YARN-6382:


Thanks for pointing this out [~haibochen]. Yes, with asynchonous buffering and 
size based-flush this can happen.
The periodic buffering can cause the same issue.

Here is the scenario:
* Internal buffer in buffered mutator is almost full
* Thead A does a write (which we know will cause issues later down the road)
* Thead B does a write.
** This write causes the buffer to be full, or perhaps thread B calls flush, or 
a timer calls flush.
** The earlier put from A caused an issue
** Thread B gets an error back, not knowing exactly which put failed, it can 
re-try its write later
* The buffer is now empty
* Thread A does a flush to confirm that its previous write made it through
* Thread A receives a success status, because there are no further issues
* Thread A incorrectly assumes that its writes were successfully written

There seem to be three options to deal with this:
a) Make writes synchronous, ie. for important writes do not use a buffered 
Mutator. The APIs would have to change, and performance might be significantly 
impacted as we saw in tests early on in the application timeline service 
development.
b) Modify the API for the BufferedMutator (or not use the public API that comes 
along from instantiating one from the connection, ie -> hackery required). For 
a put we would return the batch-id (see work on HBASE-17018) to indicate which 
batch of writes a put went into. Then for the flush, we'd change the API as 
well to take a batch ID as in input argument. The (Spooling)BufferedMutator 
would then have to keep track of a limited list of recent failed batches for 
failed flushes. When threads ask if their batch fails, we can check the 
earliest entry in the failed list against the requested batch and return 
whether it was successful, failed, or if we don't know for sure (due to the 
limit in # failed batches we want to keep).

This becomes all more complicated when we start considering spooling, because 
the error can happen much later. In the presence of spooling, all we really 
"guarantee" is that puts are persisted to a (distributed) filesystem, and that 
we'll do our utmost best to replay. Of course operators of a particular 
installation may choose to spool after an infinite amount of time, essentially 
blocking writes until they can be pushed into HBase.

This leads us to the third option to deal with these race conditions:
c) Document the conditions in JavaDoc and/or the external documentation, and 
move on for now. Language could be something like:
{noformat}
Under rare circumstances, some race conditions can exist between writers and 
internal buffer flushing that make it appear that a flush succeeds after a 
problematic write.
{noformat}

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -
>
> Key: YARN-6382
> URL: https://issues.apache.org/jira/browse/YARN-6382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>  Labels: yarn-5355-merge-blocker
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org