[
https://issues.apache.org/jira/browse/GOBBLIN-1739?focusedWorklogId=824780&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-824780
]
ASF GitHub Bot logged work on GOBBLIN-1739:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 09/Nov/22 23:05
Start Date: 09/Nov/22 23:05
Worklog Time Spent: 10m
Work Description: phet commented on code in PR #3596:
URL: https://github.com/apache/gobblin/pull/3596#discussion_r1018477670
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/dataset/IcebergDatasetDescriptor.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.dataset;
+
+import java.io.IOException;
+import java.util.List;
+
+import com.google.common.base.Splitter;
+import com.typesafe.config.Config;
+
+import lombok.EqualsAndHashCode;
+
+
+@EqualsAndHashCode(callSuper = true)
+public class IcebergDatasetDescriptor extends SqlDatasetDescriptor {
Review Comment:
deserves javadoc
(I'm not per se against it from a pragmatic standpoint, but maybe
contextualize the decision to make this a sql dataset.) looking again at the
base class, mostly we adopt here to follow hive, but whereas that is actually a
SQL DB, iceberg is not inherently. also I notice that the defaults there, e.g.:
```
this.databaseName = ConfigUtils.getString(config,
DatasetDescriptorConfigKeys.DATABASE_KEY, ".*");
```
are not currently valid for iceberg. so question is: do we gain enough from
the derivation or should we fallback to extending `BaseDatasetDescriptor`?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/dataset/IcebergDatasetDescriptor.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.dataset;
+
+import java.io.IOException;
+import java.util.List;
+
+import com.google.common.base.Splitter;
+import com.typesafe.config.Config;
+
+import lombok.EqualsAndHashCode;
+
+
+@EqualsAndHashCode(callSuper = true)
+public class IcebergDatasetDescriptor extends SqlDatasetDescriptor {
+ public IcebergDatasetDescriptor(Config config)
+ throws IOException {
+ super(config);
+ }
+
+ @Override
+ protected boolean isPlatformValid() {
+ return "iceberg".equalsIgnoreCase(getPlatform());
+ }
+
+ @Override
+ protected boolean isPathContaining(DatasetDescriptor other) {
+ String otherPath = other.getPath();
+ if (otherPath == null) {
+ return false;
+ }
+
+ //Extract the dbName and tableName from otherPath
+ List<String> parts = Splitter.on(SEPARATION_CHAR).splitToList(otherPath);
+ if (parts.size() != 2) {
Review Comment:
let's avoid *upper* bound on size. AFAIU iceberg spec allows IDs to be more
than two levels.
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/flowgraph/datanodes/iceberg/IcebergDataNode.java:
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.flowgraph.datanodes.iceberg;
+
+import com.typesafe.config.Config;
+
+import lombok.EqualsAndHashCode;
+
+import org.apache.gobblin.annotation.Alpha;
+import org.apache.gobblin.data.management.copy.iceberg.IcebergHiveCatalog;
+import org.apache.gobblin.service.modules.dataset.IcebergDatasetDescriptor;
+import
org.apache.gobblin.service.modules.flowgraph.datanodes.hive.HiveDataNode;
+
+@Alpha
+@EqualsAndHashCode(callSuper = true)
+public class IcebergDataNode extends HiveDataNode {
Review Comment:
I'm suspicious of the derivation, because it's not clear that iceberg passes
the "can-substitute-for" test with hive. in practice violating this may make
RTTI brittle and order-dependent (i.e. when using `instanceof`). I suspect you
did it for the metastore URI, but maybe that should instead motivate breaking
the commonality into a shared base class for both this and the `HiveDataNode`
(separately).
that said, that same HMS URI betrays the fact that hive is just one possible
catalog for iceberg tables. thus, likely we'd want each datanode type to
represent the combo of iceberg + the specific catalog type. as the community
one day moves from iceberg-riding-atop-HMS, that will capture the difference.
therefore I suggest renaming to `IcebergOnHiveDataNode`.
##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/flowgraph/datanodes/iceberg/IcebergDataNodeTest.java:
##########
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.flowgraph.datanodes.iceberg;
+
+import java.net.URI;
+import java.net.URISyntaxException;
+
+import org.testng.Assert;
+import org.testng.annotations.AfterMethod;
+import org.testng.annotations.BeforeMethod;
+import org.testng.annotations.Test;
+
+import com.typesafe.config.Config;
+import com.typesafe.config.ConfigFactory;
+import com.typesafe.config.ConfigValueFactory;
+
+import org.apache.gobblin.service.modules.flowgraph.DataNode;
+import org.apache.gobblin.service.modules.flowgraph.FlowGraphConfigurationKeys;
+
+
+public class IcebergDataNodeTest {
+
+ Config config = null;
+
+ @BeforeMethod
+ public void setUp() {
+ String expectedNodeId = "some-iceberg-node-id";
+ String expectedAdlFsUri = "hdfs://data.hdfs.core.windows.net";
+ String expectedHiveMetastoreUri = "thrift://hcat.company.com:7552";
Review Comment:
nit: these don't seem to be expected, as much as what you actually use to
initialize
##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/dataset/IcebergDatasetDescriptorTest.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.dataset;
+
+import java.io.IOException;
+
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+import com.typesafe.config.Config;
+import com.typesafe.config.ConfigFactory;
+import com.typesafe.config.ConfigValueFactory;
+
+import
org.apache.gobblin.service.modules.flowgraph.DatasetDescriptorConfigKeys;
+
+
+public class IcebergDatasetDescriptorTest {
+ Config baseConfig =
ConfigFactory.empty().withValue(DatasetDescriptorConfigKeys.PLATFORM_KEY,
ConfigValueFactory.fromAnyRef("iceberg"))
+ .withValue(DatasetDescriptorConfigKeys.DATABASE_KEY,
ConfigValueFactory.fromAnyRef("testDb_Db1"))
+ .withValue(DatasetDescriptorConfigKeys.TABLE_KEY,
ConfigValueFactory.fromAnyRef("testTable_Table1"));
+
+ @Test
+ public void testIsPathContaining() throws IOException {
+ Config config1 =
ConfigFactory.empty().withValue(DatasetDescriptorConfigKeys.PLATFORM_KEY,
ConfigValueFactory.fromAnyRef("iceberg"))
+ .withValue(DatasetDescriptorConfigKeys.DATABASE_KEY,
ConfigValueFactory.fromAnyRef("testDb_Db1"))
+ .withValue(DatasetDescriptorConfigKeys.TABLE_KEY,
ConfigValueFactory.fromAnyRef("testTable_Table1"));
+
+ Config config2 =
ConfigFactory.empty().withValue(DatasetDescriptorConfigKeys.PLATFORM_KEY,
ConfigValueFactory.fromAnyRef("iceberg"))
+ .withValue(DatasetDescriptorConfigKeys.DATABASE_KEY,
ConfigValueFactory.fromAnyRef("testDb_Db2"))
+ .withValue(DatasetDescriptorConfigKeys.TABLE_KEY,
ConfigValueFactory.fromAnyRef("testTable_Table2"));
Review Comment:
might be more meaningful to test same DB, but different table name
##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/flowgraph/datanodes/iceberg/IcebergDataNodeTest.java:
##########
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.flowgraph.datanodes.iceberg;
+
+import java.net.URI;
+import java.net.URISyntaxException;
+
+import org.testng.Assert;
+import org.testng.annotations.AfterMethod;
+import org.testng.annotations.BeforeMethod;
+import org.testng.annotations.Test;
+
+import com.typesafe.config.Config;
+import com.typesafe.config.ConfigFactory;
+import com.typesafe.config.ConfigValueFactory;
+
+import org.apache.gobblin.service.modules.flowgraph.DataNode;
+import org.apache.gobblin.service.modules.flowgraph.FlowGraphConfigurationKeys;
+
+
+public class IcebergDataNodeTest {
+
+ Config config = null;
+
+ @BeforeMethod
+ public void setUp() {
+ String expectedNodeId = "some-iceberg-node-id";
+ String expectedAdlFsUri = "hdfs://data.hdfs.core.windows.net";
+ String expectedHiveMetastoreUri = "thrift://hcat.company.com:7552";
+
+ config = ConfigFactory.empty()
+ .withValue(FlowGraphConfigurationKeys.DATA_NODE_ID_KEY,
ConfigValueFactory.fromAnyRef(expectedNodeId))
+ .withValue(FlowGraphConfigurationKeys.DATA_NODE_PREFIX + "fs.uri",
ConfigValueFactory.fromAnyRef(expectedAdlFsUri))
+ .withValue(FlowGraphConfigurationKeys.DATA_NODE_PREFIX +
"hive.metastore.uri", ConfigValueFactory.fromAnyRef(expectedHiveMetastoreUri));
+ }
+
+ @AfterMethod
+ public void tearDown() {
+ }
+
+ @Test
+ public void testIcebergDataNodeWithValidMetastoreUri() throws
DataNode.DataNodeCreationException, URISyntaxException {
+ IcebergDataNode icebergDataNode = new IcebergDataNode(config);
+ URI uri = new
URI(config.getString(FlowGraphConfigurationKeys.DATA_NODE_PREFIX +
"hive.metastore.uri"));
+ Assert.assertTrue(icebergDataNode.isMetastoreUriValid(uri));
+ }
+
+ @Test(expectedExceptions = DataNode.DataNodeCreationException.class)
+ public void testIcebergDataNodeWithInvalidMetastoreUri() throws
DataNode.DataNodeCreationException, URISyntaxException {
+ String expectedHiveMetastoreUri = "thrift-1://hcat.company.com:7552";
Review Comment:
maybe `badHiveMetastoreUri` or `bogusHM...`
##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/dataset/IcebergDatasetDescriptorTest.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.dataset;
+
+import java.io.IOException;
+
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+import com.typesafe.config.Config;
+import com.typesafe.config.ConfigFactory;
+import com.typesafe.config.ConfigValueFactory;
+
+import
org.apache.gobblin.service.modules.flowgraph.DatasetDescriptorConfigKeys;
+
+
+public class IcebergDatasetDescriptorTest {
+ Config baseConfig =
ConfigFactory.empty().withValue(DatasetDescriptorConfigKeys.PLATFORM_KEY,
ConfigValueFactory.fromAnyRef("iceberg"))
+ .withValue(DatasetDescriptorConfigKeys.DATABASE_KEY,
ConfigValueFactory.fromAnyRef("testDb_Db1"))
+ .withValue(DatasetDescriptorConfigKeys.TABLE_KEY,
ConfigValueFactory.fromAnyRef("testTable_Table1"));
Review Comment:
how about a helper method taking `(dbName, tableName)`?
Issue Time Tracking
-------------------
Worklog Id: (was: 824780)
Time Spent: 0.5h (was: 20m)
> Define Datanodes and Dataset Descriptor for Iceberg
> ---------------------------------------------------
>
> Key: GOBBLIN-1739
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1739
> Project: Apache Gobblin
> Issue Type: Task
> Reporter: Meeth Gala
> Priority: Minor
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)