Sandeep - Any progress here? Any more ideas? We haven't had a chance to
look much closer yet but if you need more logs let me know.

Kevin Risden

On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <[email protected]> wrote:

> This is interesting (and troubling). Thanks for the detailed analysis
> Kevin, this really helps to narrow down the problem. I will try to take a
> look at Auth failures and Concurrency issue that are still to be
> investigated.
>
> Best,
> Sandeep
>
> On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote:
>
>> Responding to myself to add some more detail after digging deeper:
>>
>> More evidence from duplicate correlation ids for the following cases:
>>
>>    - this affects different services (same correlation id, different
>>    services being called)
>>    - multiple successful individual requests with same correlation id
>>
>> I'll try anonymize them and share to this thread in the next few days.
>>
>> We had a few theories that we are slowly working through
>>
>>    - Correlation id generation is returning duplicates?
>>    - Status: Basically disproven
>>       - https://github.com/apache/knox/blob/master/gateway-server/
>>       src/main/java/org/apache/hadoop/gateway/filter/Correlat
>>       ionHandler.java#L37
>>       
>> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37>
>>       - Wrote a small Java program to test this.
>>          - UUID.randomUUID().toString in multiple threads
>>          - No duplicates returned for 100+ threads and 10000 ids
>>          generated quickly
>>       - Pretty sure the random number generator is ok
>>    - Requests from same IP address = same correlation id?
>>       - Status: Disproven
>>       - This doesn't make sense for the different services
>>       - Different users would be hitting different services (hive vs
>>       hbase)
>>    - If it only occurs with authentication failures could be the system
>>    recycling the ID?
>>       - Status: Not explored yet
>>       - Active conversations require a unique ID, so it might be that
>>       auth fail 401 = conversation over and UUID recycling.
>>       - Not sure how to test this right now
>>    - CorrelationContext or CorrelationService or CorrelationHandler is
>>    getting confused?
>>       - The filter is adding the same id multiple times?
>>       - Asynchronous calls are being handled differently?
>>
>> As of right now we don't have a smoking gun as to what is happening, but
>> I would guess there is something in Knox causing the duplicates to happen.
>>
>> Any ideas on where to look next would be appreciated.
>>
>>
>>
>> Kevin Risden
>>
>> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> wrote:
>>
>>> From the Knox User list thread: "Multiple topology audit logging", it
>>> came to my attention that Knox seems to be logging duplicate correlation
>>> ids. Separating out the topic specifically here to dig a bit deeper.
>>>
>>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the
>>> "correlation id" doesn't seem to be unique across requests. Is this to be
>>> expected? Here is a snippet (anonymized):
>>>
>>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log
>>>>
>>>
>>>
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups:
>>>> []
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>> ID2%2Fd%3Araw|unavailable|Request
>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest>
>>>>  method:
>>>> GET
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>> ID2%2Fd%3Araw|success|Response
>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse>
>>>>  status:
>>>> 200
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR
>>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response
>>>> status: 200
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP
>>>> authentication failed.
>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401
>>>
>>>
>>> The things to highlight here for the same correlation id:
>>> * different topologies are being used
>>> * different uris are being used
>>> * different users are being used
>>>
>>> Some of the things that we have configured that could impact results:
>>> * authentication caching
>>> * multiple Knox servers
>>> * load balancer in front of Knox
>>>
>>> I would be surprised if those things would make one Knox node (it could
>>> affect our other Knox nodes haven't looked yet) have duplicate correlation
>>> ids.
>>>
>>> From Larry Mccay:
>>>
>>>> Interesting...
>>>>
>>>> I have not seen this before at all.
>>>> I don't think there is anyway for those things to affect it either -
>>>> especially the HA related things.
>>>> Those entries are all a single request that will not go to different
>>>> hosts - unless there is are redirects involved and then it should become a
>>>> separate correlation id.
>>>>
>>>> We should dig into the creation of the id to see what could cause it.
>>>>
>>>
>>> From myself:
>>>
>>>> Yea that is what I was afraid of :) I looked a bit further and it isn't
>>>> every request that has duplicates. The ones I have found so far are from
>>>> LDAP authentication failures. There could very well be others since I've
>>>> only been looking at authentication failures.
>>>> I'll see if my team can dig some more information up. I'm hoping to get
>>>> a few more people involved with Apache Knox. It will probably take a few
>>>> days to dig further.
>>>
>>>
>>> Kevin Risden
>>>
>>
>>
>

Reply via email to