[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593826#comment-14593826
 ] 

Joep Rottinghuis commented on YARN-3051:
----------------------------------------

Not all arguments are equally selective. For example, relatesTo (entities) are 
not stored in individual cells that can be used as a push down predicate for 
the HBase tables. We'd have to select all entities that match the other 
criteria, select the relatesTo string, parse it into individual fields and do 
set operations on them.
{code}
  Set<TimelineEntity> getEntities(String userId, String clusterId, String 
flowId,
      String flowRunId, String appId, String entityType, Long limit,
      Long createdTimeBegin, Long createdTimeEnd, Long modifiedTimeBegin,
      Long modifiedTimeEnd, Set<TimelineEntity.Identifier> relatesTo,
      Set<TimelineEntity.Identifier> isRelatedTo, Set<KeyValuePair> info,
      Set<KeyValuePair> configs, Set<String> events, Set<String> metrics,
      EnumSet<Field> fieldsToRetrieve) throws IOException;
}
{code}

If we defer being able to effectively select a subset of columns, what does it 
actually mean to specify a Set<KeyValuePair> ?
Can the value be null to indicate that we don't care what the value is and that 
means that we want the column back in the result?

I think we should separate out predicates (give me all X where Y=Z) versus 
selectors (give me all X...).
It is not clear in the latest patch if fully populated entities will be 
returned.

Wrt.
{quote}
Makes sense. We could use a regex or club different configs into different 
groups and let user query that group. But then the problem will be how do we 
specify those groups. So as you say lets defer it and discuss it at length when 
we take it up.
{quote}
and
{quote}
One thing though, along the lines of patch submitted earlier, I can include 
something like Map<String, NameValueRelations> for metrics in the interface for 
specifying relational operations . It will support things like metricA>val1 and 
metricA<val2 as well(means 2 conditions on the same metric to specify a range). 
Thoughts ?
{quote}

Before we invent our own way how to specify which columns (metrics, configs 
etc.) we'll retrieve let's make sure that what we come up with can efficiently 
be mapped to our backing store.
As we've selected HBase as the major implementation to handle queries at scale, 
that means that we need to think how to make effective use of filters 
(https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterBase.html)
 to aggressively reduce what we pull back from HBase. ColumnPrefixFilter for 
example will be a good way to express which config columns to retrieve. A regex 
will be a poor way, as that will result in having to pull back every columns, 
and then dropping values from a retrieved result.

Similarly, if our rowkeys are prefixed by users then creating an API that 
doesn't include the user (only the cluster) means that we're doing a full table 
scan, albeit with skipfilters that let us skip over users that we're not 
interested in.

In an earlier patch I saw NameValueRelation that was able to perform the 
operations. That again assumes that all values will be retrieved from the 
backing store, and then filtered in the reader before returned to the user. It 
will be more effective to make sure we can easily map this to operations we can 
push into HBase itself (through a ColumnValueFilter) through the available 
operations 
(https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html).
I'm certainly not arguing to have these HBase specific classes exposed in our 
API, but our methods should closely match what can be done, which I don't think 
will be overly restrictive or unreasonable.

If we're going to have two types of tables in the backing store:
a) HBase native tables, specifically structured for efficient storage and 
retrieval
and 
b) Phoenix tables (mainly time based aggregates and aggregates over non-primary 
key prefixes), specifically structured for flexible querying
would it make sense to break these two queries into separate families?
Or are we thinking that based on what arguments are passed in, we decide which 
tables to query with which mechanism?


> [Storage abstraction] Create backing storage read interface for ATS readers
> ---------------------------------------------------------------------------
>
>                 Key: YARN-3051
>                 URL: https://issues.apache.org/jira/browse/YARN-3051
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Varun Saxena
>         Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, 
> YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to