Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-22 Thread Anand Patil via Review Board


> On Oct. 22, 2019, 6:31 p.m., Sridhar K wrote:
> > addons/models/4000-MachineLearning/4010-ml_model.json
> > Lines 123 (patched)
> > 
> >
> > I am not the right person for this comment. But, this name looks 
> > confusingis it a tag or image url?
> 
> Na Li wrote:
> Anand is very experienced in ML, and he said this name is commonly used 
> for that purpose. It contains where to get the container image of a model

"Image tag" is a standard Docker term: 
https://docs.docker.com/engine/reference/commandline/tag/#tag-an-image-for-a-private-repository
 It does include a sort of URL. I agree that the term is confusing, though.


- Anand


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218332
---


On Oct. 22, 2019, 6:17 p.m., Na Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71619/
> ---
> 
> (Updated Oct. 22, 2019, 6:17 p.m.)
> 
> 
> Review request for atlas, Austin Nobis, Ashutosh Mestry, Karthik Manamcheri, 
> Sridhar K, Madhan Neethiraj, and Sarath Subramanian.
> 
> 
> Bugs: atlas-3464
> https://issues.apache.org/jira/browse/atlas-3464
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Define entities used for Machine Learning Governance
> 
> 
> Diffs
> -
> 
>   addons/models/4000-MachineLearning/4010-ml_model.json PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/71619/diff/7/
> 
> 
> Testing
> ---
> 
> verified it is valid json file
> 
> 
> Thanks,
> 
> Na Li
> 
>



Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-21 Thread Anand Patil via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218304
---


Ship it!




Ship It!

- Anand Patil


On Oct. 21, 2019, 4:13 a.m., Na Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71619/
> ---
> 
> (Updated Oct. 21, 2019, 4:13 a.m.)
> 
> 
> Review request for atlas, Austin Nobis, Ashutosh Mestry, Karthik Manamcheri, 
> Sridhar K, Madhan Neethiraj, and Sarath Subramanian.
> 
> 
> Bugs: atlas-3464
> https://issues.apache.org/jira/browse/atlas-3464
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Define entities used for Machine Learning Governance
> 
> 
> Diffs
> -
> 
>   addons/models/1000-Hadoop/-ml_model.json PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/71619/diff/4/
> 
> 
> Testing
> ---
> 
> verified it is valid json file
> 
> 
> Thanks,
> 
> Na Li
> 
>



Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-18 Thread Anand Patil via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218292
---




addons/models/1000-Hadoop/-ml_model.json
Lines 27 (patched)


Suggest aligning with the CML model deployment statuses for now: 
https://github.infra.cloudera.com/Sense/cloudera-sense/blob/master/services/proto/common.proto#L55


- Anand Patil


On Oct. 18, 2019, 2 a.m., Na Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71619/
> ---
> 
> (Updated Oct. 18, 2019, 2 a.m.)
> 
> 
> Review request for atlas, Austin Nobis, Ashutosh Mestry, Karthik Manamcheri, 
> Sridhar K, Madhan Neethiraj, and Sarath Subramanian.
> 
> 
> Bugs: atlas-3464
> https://issues.apache.org/jira/browse/atlas-3464
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Define entities used for Machine Learning Governance
> 
> 
> Diffs
> -
> 
>   addons/models/1000-Hadoop/-ml_model.json PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/71619/diff/3/
> 
> 
> Testing
> ---
> 
> verified it is valid json file
> 
> 
> Thanks,
> 
> Na Li
> 
>



Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-16 Thread Anand Patil via Review Board


> On Oct. 16, 2019, 7:36 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 10 (patched)
> > 
> >
> > "DataSet" in Atlas has different meaning from the meaning in ML.  
> > 
> > DataSet: This type extends Referenceable and Asset. Conceptually, it 
> > can be used to represent an type that stores data. In Atlas, hive tables, 
> > Sqoop RDBMS tables etc are all types that extend from DataSet. Types that 
> > extend DataSet can be expected to have a Schema in the sense that they 
> > would have an attribute that defines attributes of that dataset. For e.g. 
> > the columns attribute in a hive_table. Also entities of types that extend 
> > DataSet participate in data transformation and this transformation can be 
> > captured by Atlas via lineage (or provenance) graphs. 
> > https://atlas.apache.org/0.8.1/TypeSystem.html

Hmm, it still feels like an odd fit to me, eg model builds and users don't 
really have a schema. What would you think about instead making our types 
inherit from Referenceable and Asset?


> On Oct. 16, 2019, 7:36 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 18 (patched)
> > 
> >
> > I don't want this to be optional. However, right now, some data source 
> > does not provide this info.

Could you give an example?


> On Oct. 16, 2019, 7:36 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 43 (patched)
> > 
> >
> > One use case I can think of is when a user cannot access a project, it 
> > would be useful to know that the project is a private, and then the user 
> > should be a collaborator. If the project is public, then something else 
> > could go wrong.
> > 
> > For some business, it may be desirable for the project to be private.

OK.


> On Oct. 16, 2019, 7:36 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 51 (patched)
> > 
> >
> > Those fields are optional. So if some projects cannot provide those 
> > fields, that is OK. The attributes I put here are used in several projects.
> > 
> > If we put several "key:value" pairs in a single string, we risk losing 
> > some info. 
> > 
> > I think the big drawback of putting several key:value pairs in a single 
> > json string is that:some can be lost by accident. For example the 
> > customAttributes = "{githubRepoURL:https://host1/project1 }". Then later 
> > on, an update comes in with customAttributes = "{mlFramework:tensorFlow}" 
> > because that application has no idea someone is tracking githubRepoURL, or 
> > did not set githubRepoURL because only mlFramework has changed. Then the 
> > previous info on where the source code is got lost by accident. It is hard 
> > to debug. Besides, each time someone wants to use the info, has to parse 
> > the string.
> > 
> > The counter argument can be how about putting these attributes here. If 
> > they are not popular, we just don't use them.
> > 
> > The following info is from Ashutosh. 
> > Having an attribute that contains json blob is not searchable at Atlas 
> > in general. 
> > 
> > Other benefit of adding an attribute to the model is that entities 
> > created will get validated for the type
> > because of validation, we can be certain about the data present in that 
> > field
> > 
> > A hacky approach to add json blob and make it searchable is to set them 
> > in customAttributes, which is system property. We cannot see what keys are 
> > used in customAttributes at model files. it is set by whoever creates the 
> > Atlas entity.

Instead of a string type containing json, what about a map? I 
see that type used in some other attributes.


- Anand


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218237
---


On Oct. 16, 2019, 12:30 a.m., Na Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71619/
> ---
> 
> (Updated Oct. 16, 2019, 12:30 a.m.)
> 
> 
> Review request for atlas, Austin Nobis, Ashutosh Mestry, Karthik Manamcheri, 
> Sridhar K, Madhan Neethiraj, and Sarath Subramanian.
> 
> 
> Bugs: atlas-3464
> https://issues.apache.org/jira/browse/atlas-3464
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Define entities used for Machine Learning Governance
> 
> 
> Diffs
> -
> 
>   addons/models/1000-Hadoop/-ml_model.json 

Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-16 Thread Anand Patil via Review Board


> On Oct. 16, 2019, 8:10 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 175 (patched)
> > 
> >
> > do you want to define types that can be used for hyperparameters?
> > 
> > Can you give me some examples that we can define general purpose types 
> > to contain hyper parameters?

Floats would be most common, but ints, bools, categoricals and arrays are all 
possible. @mlw at cloudera could give you some more color.


> On Oct. 16, 2019, 8:10 p.m., Na Li wrote:
> > addons/models/1000-Hadoop/-ml_model.json
> > Lines 197 (patched)
> > 
> >
> > the status can be "deprecated" for example. We need to look from ML 
> > operation point of view, not from how model build is created.
> > 
> > If operation engineer found some issue on a model build, which will 
> > then be marked as "deprecated", and some model may be in "dev" stage, and 
> > some are approved to be in "production". We need those info to help decide 
> > what model build to deploy.
> > 
> > For different deployment situations, the status can be of different 
> > values. Making it as enum may be too restrictive.

Hmmm. If status can be any user-chosen string, as opposed to an enum, then we 
aren't going to be able to do much with it in UI or tooling. For example, some 
users may choose 'deprecated' and some may choose 'stale'. In that case we 
could leave it to a flexible metadata attribute and keep the type definition 
small at this stage. We will definitely want to add attributes as we start to 
get feedback from users.


- Anand


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218238
---


On Oct. 16, 2019, 12:30 a.m., Na Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71619/
> ---
> 
> (Updated Oct. 16, 2019, 12:30 a.m.)
> 
> 
> Review request for atlas, Austin Nobis, Ashutosh Mestry, Karthik Manamcheri, 
> Sridhar K, Madhan Neethiraj, and Sarath Subramanian.
> 
> 
> Bugs: atlas-3464
> https://issues.apache.org/jira/browse/atlas-3464
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Define entities used for Machine Learning Governance
> 
> 
> Diffs
> -
> 
>   addons/models/1000-Hadoop/-ml_model.json PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/71619/diff/1/
> 
> 
> Testing
> ---
> 
> verified it is valid json file
> 
> 
> Thanks,
> 
> Na Li
> 
>



Re: Review Request 71619: ATLAS-3464: Define Entities stored in Atlas for ML Governance

2019-10-16 Thread Anand Patil via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71619/#review218231
---




addons/models/1000-Hadoop/-ml_model.json
Lines 10 (patched)


Why make project a subtype of dataset?



addons/models/1000-Hadoop/-ml_model.json
Lines 18 (patched)


Can this really be optional?



addons/models/1000-Hadoop/-ml_model.json
Lines 43 (patched)


Do we have a use case for project visibility, or is it redundant with 
permissions? In other words, if user X has read access to project Y, do we care 
whether the reason is that the user is a collaborator to a private project or 
that the project is public?



addons/models/1000-Hadoop/-ml_model.json
Lines 51 (patched)


businessUseCase, modelFramework, modelAlgorithms, githubRepoURL, 
notebookURL and resourceURL feel prematurely opinionated to me. Some projects 
will not use all these fields, and for some projects fields other than these 
will probably be more important.

Would it be possible in Atlas to move these to a single "metadata" 
attribute whose value is key-value pairs? We can then move to stronger typing 
as common patterns emerge.



addons/models/1000-Hadoop/-ml_model.json
Lines 135 (patched)


Same question about why dataset is the supertype.



addons/models/1000-Hadoop/-ml_model.json
Lines 165 (patched)


Same comment about metadata: gitCommitId, hyperParameters, 
outcomeTypeDescription feel prematurely standardized. I feel that a 
map metadata field would be better suited to where we are now.



addons/models/1000-Hadoop/-ml_model.json
Lines 175 (patched)


It's slightly awkward to have to stringify all hyperparameter values. Does 
Atlas have option types?



addons/models/1000-Hadoop/-ml_model.json
Lines 197 (patched)


If status is non-optional, it should probably be an enum or Altus 
equivalent.

Also, I'm not sure this field is required for model builds as opposed to 
model deployments. Builds' status is very simple - building, built or errored I 
suppose. We don't necessarily need to represent building or errored builds in 
Atlas.



addons/models/1000-Hadoop/-ml_model.json
Lines 204 (patched)


In addition to defaultCpuCores and defaultMemoryMb, we should probably have 
a field indicating whether the model utilizes GPU's.



addons/models/1000-Hadoop/-ml_model.json
Lines 220 (patched)


I'm not sure defaultReplicas belongs on the model build. The correct number 
of replicas depends on usage, it's not really a property of the model itself.



addons/models/1000-Hadoop/-ml_model.json
Lines 228 (patched)


"imageTag" would be more standard terminology. The tag typically includes 
the entire image URL.

Consider also adding the image hash, as tags/url's are not necessarily 
unique.



addons/models/1000-Hadoop/-ml_model.json
Lines 248 (patched)


Should we also have exampleResponse?



addons/models/1000-Hadoop/-ml_model.json
Lines 264 (patched)


Same question about supertype



addons/models/1000-Hadoop/-ml_model.json
Lines 268 (patched)


We don't need a uniqueID for model deployments?



addons/models/1000-Hadoop/-ml_model.json
Lines 306 (patched)


I don't think we shoud expose serviceIP. It may not be distinct from 
modelEndpointURL in non-Kubernetes serving environments and when it is, we may 
not want to encourage use of it. For example, when a model is being called from 
outside the k8s cluster where it is running, the service IP is not usable.



addons/models/1000-Hadoop/-ml_model.json
Lines 319 (patched)


If this is required I think it should be an enum or equivalent in Atlas.



addons/models/1000-Hadoop/-ml_model.json
Lines 323 (patched)


Use the same units for CPU and memory here as in the defaults for model 
build. I would go with millicores and bytes.



addons/models/1000-Hadoop/-ml_model.json
Lines 336 (patched)