[ 
https://issues.apache.org/jira/browse/NIFI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Kawamura updated NIFI-4818:
--------------------------------
    Description: 
ReportLineageToAtlas parses Hive JDBC connection URLs to get database names. It 
works if a connection URL does not have parameters. (e.g. 
jdbc:hive2://host:port/dbName) But it reports wrong database name if there are 
parameters. E.g. with 
jdbc:hive2://host.port/dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2,
 the reported database name will be 
dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2, including 
the connection parameters.

Also, if there are more than one host:port defined, it will not be able to 
analyze cluster name from hostnames correctly.

Similarly for Kafka topic, the reporting task uses transit URIs to analyze 
hostnames and topic names. It does handle multiple host:port definitions within 
a URI, however, current logic only uses the first hostname entry even if there 
are multiple ones. For example, with a transit URI, 
"PLAINTEXT://0.example.com:6667,1.example.com:6667/topicA", it uses 
"0.example.com" to match configured regular expressions to derive a cluster 
name. If none of regex matches, then it uses the default cluster name without 
looping through all hostnames. It never uses the 2nd or later hostnames to 
derive a cluster name.

  was:
ReportLineageToAtlas parses Hive JDBC connection URLs to get database names. It 
works if a connection URL does not have parameters. (e.g. 
jdbc:hive2://host:port/dbName) But it reports wrong database name if there are 
parameters. (e.g. 
jdbc:hive2://host.port/dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2)

Also, if there are more than one host:port defined, it will not be able to 
analyze cluster name from hostnames correctly.

Similarly for Kafka topic, the reporting task uses transit URIs to analyze 
hostnames and topic names. It does handle multiple host:port definitions within 
a URI, however, current logic only uses the first hostname entry even if there 
are multiple ones. For example, with a transit URI, 
"PLAINTEXT://0.example.com:6667,1.example.com:6667/topicA", it uses 
"0.example.com" to match configured regular expressions to derive a cluster 
name. If none of regex matches, then it uses the default cluster name without 
looping through all hostnames. It never uses the 2nd or later hostnames to 
derive a cluster name.


> Fix transit URL parsing at Hive2JDBC and KafkaTopic for ReportLineageToAtlas
> ----------------------------------------------------------------------------
>
>                 Key: NIFI-4818
>                 URL: https://issues.apache.org/jira/browse/NIFI-4818
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.5.0
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>            Priority: Major
>
> ReportLineageToAtlas parses Hive JDBC connection URLs to get database names. 
> It works if a connection URL does not have parameters. (e.g. 
> jdbc:hive2://host:port/dbName) But it reports wrong database name if there 
> are parameters. E.g. with 
> jdbc:hive2://host.port/dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2,
>  the reported database name will be 
> dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2, 
> including the connection parameters.
> Also, if there are more than one host:port defined, it will not be able to 
> analyze cluster name from hostnames correctly.
> Similarly for Kafka topic, the reporting task uses transit URIs to analyze 
> hostnames and topic names. It does handle multiple host:port definitions 
> within a URI, however, current logic only uses the first hostname entry even 
> if there are multiple ones. For example, with a transit URI, 
> "PLAINTEXT://0.example.com:6667,1.example.com:6667/topicA", it uses 
> "0.example.com" to match configured regular expressions to derive a cluster 
> name. If none of regex matches, then it uses the default cluster name without 
> looping through all hostnames. It never uses the 2nd or later hostnames to 
> derive a cluster name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to