[jira] [Work logged] (GOBBLIN-1339) Add cluster name to dataset descriptor

ASF GitHub Bot (Jira) Mon, 04 Jan 2021 18:05:40 -0800


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1339?focusedWorklogId=530996&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530996
 ]


ASF GitHub Bot logged work on GOBBLIN-1339:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Jan/21 02:04
            Start Date: 05/Jan/21 02:04
    Worklog Time Spent: 10m 
      Work Description: aplex commented on a change in pull request #3178:
URL: https://github.com/apache/incubator-gobblin/pull/3178#discussion_r551674731



##########
File path: 
gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
 
 package org.apache.gobblin.dataset;
 
-import java.util.Map;
-
 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.Maps;
-
 import lombok.Getter;
 
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
 
 /**
  * A {@link Descriptor} identifies and provides metadata to describe a dataset
  */
 public class DatasetDescriptor extends Descriptor {
   private static final String PLATFORM_KEY = "platform";
   private static final String NAME_KEY = "name";
+  private static final String CLUSTER_NAME_KEY = "clusterName";
 
   /**
    * which platform the dataset is stored, for example: local, hdfs, oracle, 
mysql, kafka
    */
   @Getter
   private final String platform;
 
+  /**
+   * Human-readeable cluster name.
+   *
+   * @see org.apache.gobblin.util.ClustersNames
+   */
+  @Getter
+  @Nullable
+  private final String clusterName;

Review comment:
       Initially, I was thinking about putting a well-known cluster name like 
"snap" into this field, and for unknown clusters put a DNS name like 
"datalake.corp.com". Then lineage events will include this name as-is, and it 
can be displayed to users directly.  We already use "cluster name" concept in 
org.apache.gobblin.util.ClustersNames, for getting nice names from yarn/hdfs 
urls.
   
   However, this well-know name is not easily convertible back to domain name. 
Single well-known name can map to different urls for hdfs/hive. So, there could 
be a benefit in keeping a hostname instead of clustername, and converting it to 
human-readable name before reporting lineage. 
   
   DatasetDescriptor already looks like a dataset URL with non-standard field 
names:
   - platform is like url.protocol - "hdfs", "kafka", "mysql"
   - name is like url.path - "table1" or "/data/tracking/test123"
   - clusterName is like a url.host - "my-cluster" or "my-cluster.company.com"
   
   Regarding the fields you suggested:
   - I don't think we have availability zone information anywhere.
   - We can get well-known nice name from hostname using 
org.apache.gobblin.util.ClustersNames , so don't have to store it separately.
   - Keeping port information is discussable. I would assume that any 
production deployments will have separate DNS names for different clusters. 
However, for small scale, temporary deployments it's possible to have multiple 
Hives or namenodes on a single host, running on different ports.
   
   

##########
File path: 
gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
 
 package org.apache.gobblin.dataset;
 
-import java.util.Map;
-
 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.Maps;
-
 import lombok.Getter;
 
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
 
 /**
  * A {@link Descriptor} identifies and provides metadata to describe a dataset
  */
 public class DatasetDescriptor extends Descriptor {

Review comment:
       I think inside Gobblin codebase, we work only with dataset instances. 
When we copy or ingest stuff, we always do it on a specific hdfs/kafka/other 
cluster, rather than on abstract multi-cluster dataset.  
   
   In practice, the DatasetDescriptor  is used for lineage reporting.
   
   It's not mandatory for backward-compatibility. In the OSS code base, there 
are a bunch of Salesforce/MySql connectors that don't report cluster names at 
the moment, although they can do it as well. There also could be usages in 
other libraries. 
   
   I think in the final state, cluster name should be mandatory in this class. 
For transitional period, we can get there gradually with marking old 
constructor obsolete and clustername optional.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 530996)
    Time Spent: 50m  (was: 40m)

> Add cluster name to dataset descriptor
> --------------------------------------
>
>                 Key: GOBBLIN-1339
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1339
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-core
>            Reporter: Alex Prokofiev
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is going to be used for more detailed lineage tracking



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GOBBLIN-1339) Add cluster name to dataset descriptor

Reply via email to