This is interesting (and troubling). Thanks for the detailed analysis
Kevin, this really helps to narrow down the problem. I will try to take a
look at Auth failures and Concurrency issue that are still to be
investigated.

Best,
Sandeep

On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote:

> Responding to myself to add some more detail after digging deeper:
>
> More evidence from duplicate correlation ids for the following cases:
>
>    - this affects different services (same correlation id, different
>    services being called)
>    - multiple successful individual requests with same correlation id
>
> I'll try anonymize them and share to this thread in the next few days.
>
> We had a few theories that we are slowly working through
>
>    - Correlation id generation is returning duplicates?
>    - Status: Basically disproven
>       - https://github.com/apache/knox/blob/master/gateway-
>       server/src/main/java/org/apache/hadoop/gateway/filter/
>       CorrelationHandler.java#L37
>       
> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37>
>       - Wrote a small Java program to test this.
>          - UUID.randomUUID().toString in multiple threads
>          - No duplicates returned for 100+ threads and 10000 ids
>          generated quickly
>       - Pretty sure the random number generator is ok
>    - Requests from same IP address = same correlation id?
>       - Status: Disproven
>       - This doesn't make sense for the different services
>       - Different users would be hitting different services (hive vs
>       hbase)
>    - If it only occurs with authentication failures could be the system
>    recycling the ID?
>       - Status: Not explored yet
>       - Active conversations require a unique ID, so it might be that
>       auth fail 401 = conversation over and UUID recycling.
>       - Not sure how to test this right now
>    - CorrelationContext or CorrelationService or CorrelationHandler is
>    getting confused?
>       - The filter is adding the same id multiple times?
>       - Asynchronous calls are being handled differently?
>
> As of right now we don't have a smoking gun as to what is happening, but I
> would guess there is something in Knox causing the duplicates to happen.
>
> Any ideas on where to look next would be appreciated.
>
>
>
> Kevin Risden
>
> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> wrote:
>
>> From the Knox User list thread: "Multiple topology audit logging", it
>> came to my attention that Knox seems to be logging duplicate correlation
>> ids. Separating out the topic specifically here to dig a bit deeper.
>>
>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the
>> "correlation id" doesn't seem to be unique across requests. Is this to be
>> expected? Here is a snippet (anonymized):
>>
>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log
>>>
>>
>>
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/
>>> hbase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups:
>>> []
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>> ID2%2Fd%3Araw|unavailable|Request
>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest>
>>>  method:
>>> GET
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>> ID2%2Fd%3Araw|success|Response
>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse>
>>>  status:
>>> 200
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR
>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response
>>> status: 200
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP
>>> authentication failed.
>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/
>>> hbase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401
>>
>>
>> The things to highlight here for the same correlation id:
>> * different topologies are being used
>> * different uris are being used
>> * different users are being used
>>
>> Some of the things that we have configured that could impact results:
>> * authentication caching
>> * multiple Knox servers
>> * load balancer in front of Knox
>>
>> I would be surprised if those things would make one Knox node (it could
>> affect our other Knox nodes haven't looked yet) have duplicate correlation
>> ids.
>>
>> From Larry Mccay:
>>
>>> Interesting...
>>>
>>> I have not seen this before at all.
>>> I don't think there is anyway for those things to affect it either -
>>> especially the HA related things.
>>> Those entries are all a single request that will not go to different
>>> hosts - unless there is are redirects involved and then it should become a
>>> separate correlation id.
>>>
>>> We should dig into the creation of the id to see what could cause it.
>>>
>>
>> From myself:
>>
>>> Yea that is what I was afraid of :) I looked a bit further and it isn't
>>> every request that has duplicates. The ones I have found so far are from
>>> LDAP authentication failures. There could very well be others since I've
>>> only been looking at authentication failures.
>>> I'll see if my team can dig some more information up. I'm hoping to get
>>> a few more people involved with Apache Knox. It will probably take a few
>>> days to dig further.
>>
>>
>> Kevin Risden
>>
>
>

Reply via email to