[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511129#comment-15511129
 ] 

Vrushali C edited comment on YARN-5585 at 9/21/16 8:51 PM:
-----------------------------------------------------------

Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our rest apis to accept such a parameter, it 
becomes generic enough to fetch N rows after this particular “startFromRowKey” 
value. The first rest api call will not send in anything, but each rest 
response will return “lastRowKey” to the client so that the client can use this 
in the next rest call. I have found this to be also useful for debugging the 
rest output on the browser.

- For Tez in particular, we need the ability to return sorted results in the 
rest response. In this case, results sorted based on “creation_time”.  The 
currently existing row key in the entity table does not all for sorted order of 
creation time retrieval very easily. 

So here is proposal which incorporates some aspects of both of your proposals 
Varun. 

I think we should expose a way for frameworks like Tez to store data sorted as 
per their criteria. And also allow them to specify when they want to query this 
specially sorted data. 

Today, Tez wants it sorted in entity creation time. Tomorrow, that could 
change. Also, today some other framework like Spark might want entities sorted 
based on something else. So putting it in the entity table's row key becomes a 
tough decision.

I propose we allow for auxiliary tables to be created for entities via cluster 
configuration settings. The auxiliary table name etc will be set in config in 
just like the timeline entity table name is set. This auxiliary table is 
specifically for entities, so has the same structure. 

Now, when tez’s timeline client creates a timeline entity, it will create it as 
it does right now but in addition, it will populate two new members of 
TimelineEntity object:
- auxiliaryTableName which contains the desired table name
- auxillaryEncodedKey   which contains a byte array value of  {code} 
“Inv(creation_time)!entity_id” {code}. This is to be used as part of the row 
key suffix in the auxiliary table. Timeline service does not know what this 
byte value is, it does not care. It only adds this after the regular row key 
prefix of 
{code} “user!cluster!flow!Inv(flow run id) ! 
application!entitytype!<bytes_from_client>”
{code}

Now it sends this write to timeline service. At the hbase writer side, we 
notice that the auxiliary table and auxiliary key are populated in the timeline 
entity object, so we do two writes. One write goes to our regular entity table 
with existing row key structure and other write goes to the auxiliary table 
with the row key of {code} “user!cluster!flow!Inv(flow run id)! 
application!entitytype!<bytes_from_client>”{code}.

On the reader side, we allow the rest api to now specify explicitly if the 
client want reads from the auxillary table. Else reads go to the regular entity 
table. For frameworks like Tez, whenever they need sorted data based on 
creation time, perhaps in their UI, they know that, so they can now specify as 
part of the query param in their rest query that this is for the auxiliary 
table.  

This way, we provide frameworks a way to store data in whichever sorted order 
they want and for them to determine queries need that sorted data. 





was (Author: vrushalic):
Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our rest apis to accept such a parameter, it 
becomes generic enough to fetch N rows after this particular “startFromRowKey” 
value. The first rest api call will not send in anything, but each rest 
response will return “lastRowKey” to the client so that the client can use this 
in the next rest call. I have found this to be also useful for debugging the 
rest output on the browser.

- For Tez in particular, we need the ability to return sorted results in the 
rest response. In this case, results sorted based on “creation_time”.  The 
currently existing row key in the entity table does not all for sorted order of 
creation time retrieval very easily. 

So here is proposal which incorporates some aspects of both of your proposals 
Varun. 

I think we should expose a way for frameworks like Tez to store data sorted as 
per their criteria. And also allow them to specify when they want to query this 
specially sorted data. 

Today, Tez wants it sorted in entity creation time. Tomorrow, that could 
change. Also, today some other framework like Spark might want entities sorted 
based on something else. So putting it in the entity table's row key becomes a 
tough decision.

I propose we allow for auxiliary tables to be created for entities via cluster 
configuration settings. The auxiliary table name etc will be set in config in 
just like the timeline entity table name is set. This auxiliary table is 
specifically for entities, so has the same structure. 

Now, when tez’s timeline client creates a timeline entity, it will create it as 
it does right now but in addition, it will populate two new members of 
TimelineEntity object:
- auxiliaryTableName which contains the desired table name
- auxillaryEncodedKey   which contains a byte array value of 
“Inv(creation_time)!entity_id”. This is to be used as part of the row key 
suffix in the auxiliary table. Timeline service does not know what this byte 
value is, it does not care. It only adds this after the regular row key prefix 
of “user!cluster!flow!Inv(flow run id)! 
application!entitytype!<bytes_from_client>”

Now it sends this write to timeline service. At the hbase writer side, we 
notice that the auxiliary table and auxiliary key are populated in the timeline 
entity object, so we do two writes. One write goes to our regular entity table 
with existing row key structure and other write goes to the auxiliary table 
with the row key of “user!cluster!flow!Inv(flow run id)! 
application!entitytype!<bytes_from_client>”

On the reader side, we allow the rest api to now specify explicitly if the 
client want reads from the auxillary table. Else reads go to the regular entity 
table. For frameworks like Tez, whenever they need sorted data based on 
creation time, perhaps in their UI, they know that, so they can now specify as 
part of the query param in their rest query that this is for the auxiliary 
table.  

This way, we provide frameworks a way to store data in whichever sorted order 
they want and for them to determine queries need that sorted data. 




> [Atsv2] Add a new filter fromId in REST endpoints
> -------------------------------------------------
>
>                 Key: YARN-5585
>                 URL: https://issues.apache.org/jira/browse/YARN-5585
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelinereader
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>            Priority: Critical
>         Attachments: YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&&fromId=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to