[ 
https://issues.apache.org/jira/browse/NIFI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340426#comment-16340426
 ] 

ASF GitHub Bot commented on NIFI-4818:
--------------------------------------

Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2435#discussion_r164017503
  
    --- Diff: 
nifi-nar-bundles/nifi-atlas-bundle/nifi-atlas-reporting-task/src/main/java/org/apache/nifi/atlas/NiFiAtlasHook.java
 ---
    @@ -255,7 +255,11 @@ public void commitMessages() {
                 }
                 return new Tuple<>(refQualifiedName, 
typedQualifiedNameToRef.get(toTypedQualifiedName(typeName, refQualifiedName)));
             }).filter(Objects::nonNull).filter(tuple -> tuple.getValue() != 
null)
    -                .collect(Collectors.toMap(Tuple::getKey, Tuple::getValue));
    +                // If duplication happens, use new value.
    +                .collect(Collectors.toMap(Tuple::getKey, Tuple::getValue, 
(oldValue, newValue) -> {
    +                    logger.warn("Duplicated qualified name was found, use 
the new one. oldValue={}, newValue={}", new Object[]{oldValue, newValue});
    +                    return newValue;
    +                }));
    --- End diff --
    
    While I was testing, I got the following exception:
    ```
    2018-01-25 05:06:41,430 ERROR [Timer-Driven Process Thread-1] 
o.a.n.a.reporting.ReportLineageToAtlas 
ReportLineageToAtlas[id=057986ae-0161-1000-d0b0-1b890a17f5aa] Error running 
task ReportLineageToAtlas[id=057986ae-0161-1000-d0b0-1b890a17f5aa] due to 
java.lang.IllegalStateException: Duplicate key {Id='(type: fs_path, id: 
69be7a40-4ff8-4c4e-b714-2d394c14398d)', traits=[], values={}} NiFiAtlasHook.258
    ```
    The exception means, an existing nifi_flow_path entity has more than one 
entries having pointing to the same entity having identical qualified name, 
from its inputs or outputs attribute. This happened because I was using the old 
test environment which has data created before Atlas integration implemented 
de-duplication logic. However, it would be more protective to handle such 
duplication in case if this occurs for some other reason.


> Fix transit URL parsing at Hive2JDBC and KafkaTopic for ReportLineageToAtlas
> ----------------------------------------------------------------------------
>
>                 Key: NIFI-4818
>                 URL: https://issues.apache.org/jira/browse/NIFI-4818
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.5.0
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>            Priority: Major
>
> ReportLineageToAtlas parses Hive JDBC connection URLs to get database names. 
> It works if a connection URL does not have parameters. (e.g. 
> jdbc:hive2://host:port/dbName) But it reports wrong database name if there 
> are parameters. E.g. with 
> jdbc:hive2://host.port/dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2,
>  the reported database name will be 
> dbName;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2, 
> including the connection parameters.
> Also, if there are more than one host:port defined, it will not be able to 
> analyze cluster name from hostnames correctly.
> Similarly for Kafka topic, the reporting task uses transit URIs to analyze 
> hostnames and topic names. It does handle multiple host:port definitions 
> within a URI, however, current logic only uses the first hostname entry even 
> if there are multiple ones. For example, with a transit URI, 
> "PLAINTEXT://0.example.com:6667,1.example.com:6667/topicA", it uses 
> "0.example.com" to match configured regular expressions to derive a cluster 
> name. If none of regex matches, then it uses the default cluster name without 
> looping through all hostnames. It never uses the 2nd or later hostnames to 
> derive a cluster name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to