Hello Kevin,

So I tried a bit but could not reproduce the issue. I tried with two
separate topologies and tried sending concurrent requests (loop of 100).
I tried with Auth caching and with auth failure requests still could not
reproduce the issue.

I am suspecting this has something to do with the load balancer (somehow
sticky requests).
For now I'll open a JIRA for it and target it to 0.15.0

Best,
Sandeep

On Tue, Oct 31, 2017 at 10:23 AM, Kevin Risden <[email protected]> wrote:

> Sandeep - Any progress here? Any more ideas? We haven't had a chance to
> look much closer yet but if you need more logs let me know.
>
> Kevin Risden
>
> On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <[email protected]>
> wrote:
>
>> This is interesting (and troubling). Thanks for the detailed analysis
>> Kevin, this really helps to narrow down the problem. I will try to take a
>> look at Auth failures and Concurrency issue that are still to be
>> investigated.
>>
>> Best,
>> Sandeep
>>
>> On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote:
>>
>>> Responding to myself to add some more detail after digging deeper:
>>>
>>> More evidence from duplicate correlation ids for the following cases:
>>>
>>>    - this affects different services (same correlation id, different
>>>    services being called)
>>>    - multiple successful individual requests with same correlation id
>>>
>>> I'll try anonymize them and share to this thread in the next few days.
>>>
>>> We had a few theories that we are slowly working through
>>>
>>>    - Correlation id generation is returning duplicates?
>>>    - Status: Basically disproven
>>>       - https://github.com/apache/knox/blob/master/gateway-server/sr
>>>       c/main/java/org/apache/hadoop/gateway/filter/CorrelationHand
>>>       ler.java#L37
>>>       
>>> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37>
>>>       - Wrote a small Java program to test this.
>>>          - UUID.randomUUID().toString in multiple threads
>>>          - No duplicates returned for 100+ threads and 10000 ids
>>>          generated quickly
>>>       - Pretty sure the random number generator is ok
>>>    - Requests from same IP address = same correlation id?
>>>       - Status: Disproven
>>>       - This doesn't make sense for the different services
>>>       - Different users would be hitting different services (hive vs
>>>       hbase)
>>>    - If it only occurs with authentication failures could be the system
>>>    recycling the ID?
>>>       - Status: Not explored yet
>>>       - Active conversations require a unique ID, so it might be that
>>>       auth fail 401 = conversation over and UUID recycling.
>>>       - Not sure how to test this right now
>>>    - CorrelationContext or CorrelationService or CorrelationHandler is
>>>    getting confused?
>>>       - The filter is adding the same id multiple times?
>>>       - Asynchronous calls are being handled differently?
>>>
>>> As of right now we don't have a smoking gun as to what is happening, but
>>> I would guess there is something in Knox causing the duplicates to happen.
>>>
>>> Any ideas on where to look next would be appreciated.
>>>
>>>
>>>
>>> Kevin Risden
>>>
>>> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]>
>>> wrote:
>>>
>>>> From the Knox User list thread: "Multiple topology audit logging", it
>>>> came to my attention that Knox seems to be logging duplicate correlation
>>>> ids. Separating out the topic specifically here to dig a bit deeper.
>>>>
>>>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the
>>>> "correlation id" doesn't seem to be unique across requests. Is this to be
>>>> expected? Here is a snippet (anonymized):
>>>>
>>>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log
>>>>>
>>>>
>>>>
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3ara
>>>>> w&|success|
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups:
>>>>> []
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>>> ID2%2Fd%3Araw|unavailable|Request
>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest>
>>>>>  method:
>>>>> GET
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>>> ID2%2Fd%3Araw|success|Response
>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse>
>>>>>  status:
>>>>> 200
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR
>>>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response
>>>>> status: 200
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP
>>>>> authentication failed.
>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401
>>>>
>>>>
>>>> The things to highlight here for the same correlation id:
>>>> * different topologies are being used
>>>> * different uris are being used
>>>> * different users are being used
>>>>
>>>> Some of the things that we have configured that could impact results:
>>>> * authentication caching
>>>> * multiple Knox servers
>>>> * load balancer in front of Knox
>>>>
>>>> I would be surprised if those things would make one Knox node (it could
>>>> affect our other Knox nodes haven't looked yet) have duplicate correlation
>>>> ids.
>>>>
>>>> From Larry Mccay:
>>>>
>>>>> Interesting...
>>>>>
>>>>> I have not seen this before at all.
>>>>> I don't think there is anyway for those things to affect it either -
>>>>> especially the HA related things.
>>>>> Those entries are all a single request that will not go to different
>>>>> hosts - unless there is are redirects involved and then it should become a
>>>>> separate correlation id.
>>>>>
>>>>> We should dig into the creation of the id to see what could cause it.
>>>>>
>>>>
>>>> From myself:
>>>>
>>>>> Yea that is what I was afraid of :) I looked a bit further and it
>>>>> isn't every request that has duplicates. The ones I have found so far are
>>>>> from LDAP authentication failures. There could very well be others since
>>>>> I've only been looking at authentication failures.
>>>>> I'll see if my team can dig some more information up. I'm hoping to
>>>>> get a few more people involved with Apache Knox. It will probably take a
>>>>> few days to dig further.
>>>>
>>>>
>>>> Kevin Risden
>>>>
>>>
>>>
>>
>

Reply via email to