Hello Kevin, So I tried a bit but could not reproduce the issue. I tried with two separate topologies and tried sending concurrent requests (loop of 100). I tried with Auth caching and with auth failure requests still could not reproduce the issue.
I am suspecting this has something to do with the load balancer (somehow sticky requests). For now I'll open a JIRA for it and target it to 0.15.0 Best, Sandeep On Tue, Oct 31, 2017 at 10:23 AM, Kevin Risden <[email protected]> wrote: > Sandeep - Any progress here? Any more ideas? We haven't had a chance to > look much closer yet but if you need more logs let me know. > > Kevin Risden > > On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <[email protected]> > wrote: > >> This is interesting (and troubling). Thanks for the detailed analysis >> Kevin, this really helps to narrow down the problem. I will try to take a >> look at Auth failures and Concurrency issue that are still to be >> investigated. >> >> Best, >> Sandeep >> >> On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote: >> >>> Responding to myself to add some more detail after digging deeper: >>> >>> More evidence from duplicate correlation ids for the following cases: >>> >>> - this affects different services (same correlation id, different >>> services being called) >>> - multiple successful individual requests with same correlation id >>> >>> I'll try anonymize them and share to this thread in the next few days. >>> >>> We had a few theories that we are slowly working through >>> >>> - Correlation id generation is returning duplicates? >>> - Status: Basically disproven >>> - https://github.com/apache/knox/blob/master/gateway-server/sr >>> c/main/java/org/apache/hadoop/gateway/filter/CorrelationHand >>> ler.java#L37 >>> >>> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37> >>> - Wrote a small Java program to test this. >>> - UUID.randomUUID().toString in multiple threads >>> - No duplicates returned for 100+ threads and 10000 ids >>> generated quickly >>> - Pretty sure the random number generator is ok >>> - Requests from same IP address = same correlation id? >>> - Status: Disproven >>> - This doesn't make sense for the different services >>> - Different users would be hitting different services (hive vs >>> hbase) >>> - If it only occurs with authentication failures could be the system >>> recycling the ID? >>> - Status: Not explored yet >>> - Active conversations require a unique ID, so it might be that >>> auth fail 401 = conversation over and UUID recycling. >>> - Not sure how to test this right now >>> - CorrelationContext or CorrelationService or CorrelationHandler is >>> getting confused? >>> - The filter is adding the same id multiple times? >>> - Asynchronous calls are being handled differently? >>> >>> As of right now we don't have a smoking gun as to what is happening, but >>> I would guess there is something in Knox causing the duplicates to happen. >>> >>> Any ideas on where to look next would be appreciated. >>> >>> >>> >>> Kevin Risden >>> >>> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> >>> wrote: >>> >>>> From the Knox User list thread: "Multiple topology audit logging", it >>>> came to my attention that Knox seems to be logging duplicate correlation >>>> ids. Separating out the topic specifically here to dig a bit deeper. >>>> >>>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the >>>> "correlation id" doesn't seem to be unique across requests. Is this to be >>>> expected? Here is a snippet (anonymized): >>>> >>>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log >>>>> >>>> >>>> >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3ara >>>>> w&|success| >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups: >>>>> [] >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>>> ID2%2Fd%3Araw|unavailable|Request >>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest> >>>>> method: >>>>> GET >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>>> ID2%2Fd%3Araw|success|Response >>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse> >>>>> status: >>>>> 200 >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR >>>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response >>>>> status: 200 >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP >>>>> authentication failed. >>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401 >>>> >>>> >>>> The things to highlight here for the same correlation id: >>>> * different topologies are being used >>>> * different uris are being used >>>> * different users are being used >>>> >>>> Some of the things that we have configured that could impact results: >>>> * authentication caching >>>> * multiple Knox servers >>>> * load balancer in front of Knox >>>> >>>> I would be surprised if those things would make one Knox node (it could >>>> affect our other Knox nodes haven't looked yet) have duplicate correlation >>>> ids. >>>> >>>> From Larry Mccay: >>>> >>>>> Interesting... >>>>> >>>>> I have not seen this before at all. >>>>> I don't think there is anyway for those things to affect it either - >>>>> especially the HA related things. >>>>> Those entries are all a single request that will not go to different >>>>> hosts - unless there is are redirects involved and then it should become a >>>>> separate correlation id. >>>>> >>>>> We should dig into the creation of the id to see what could cause it. >>>>> >>>> >>>> From myself: >>>> >>>>> Yea that is what I was afraid of :) I looked a bit further and it >>>>> isn't every request that has duplicates. The ones I have found so far are >>>>> from LDAP authentication failures. There could very well be others since >>>>> I've only been looking at authentication failures. >>>>> I'll see if my team can dig some more information up. I'm hoping to >>>>> get a few more people involved with Apache Knox. It will probably take a >>>>> few days to dig further. >>>> >>>> >>>> Kevin Risden >>>> >>> >>> >> >
