This is interesting (and troubling). Thanks for the detailed analysis Kevin, this really helps to narrow down the problem. I will try to take a look at Auth failures and Concurrency issue that are still to be investigated.
Best, Sandeep On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote: > Responding to myself to add some more detail after digging deeper: > > More evidence from duplicate correlation ids for the following cases: > > - this affects different services (same correlation id, different > services being called) > - multiple successful individual requests with same correlation id > > I'll try anonymize them and share to this thread in the next few days. > > We had a few theories that we are slowly working through > > - Correlation id generation is returning duplicates? > - Status: Basically disproven > - https://github.com/apache/knox/blob/master/gateway- > server/src/main/java/org/apache/hadoop/gateway/filter/ > CorrelationHandler.java#L37 > > <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37> > - Wrote a small Java program to test this. > - UUID.randomUUID().toString in multiple threads > - No duplicates returned for 100+ threads and 10000 ids > generated quickly > - Pretty sure the random number generator is ok > - Requests from same IP address = same correlation id? > - Status: Disproven > - This doesn't make sense for the different services > - Different users would be hitting different services (hive vs > hbase) > - If it only occurs with authentication failures could be the system > recycling the ID? > - Status: Not explored yet > - Active conversations require a unique ID, so it might be that > auth fail 401 = conversation over and UUID recycling. > - Not sure how to test this right now > - CorrelationContext or CorrelationService or CorrelationHandler is > getting confused? > - The filter is adding the same id multiple times? > - Asynchronous calls are being handled differently? > > As of right now we don't have a smoking gun as to what is happening, but I > would guess there is something in Knox causing the duplicates to happen. > > Any ideas on where to look next would be appreciated. > > > > Kevin Risden > > On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> wrote: > >> From the Knox User list thread: "Multiple topology audit logging", it >> came to my attention that Knox seems to be logging duplicate correlation >> ids. Separating out the topic specifically here to dig a bit deeper. >> >> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the >> "correlation id" doesn't seem to be unique across requests. Is this to be >> expected? Here is a snippet (anonymized): >> >> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log >>> >> >> >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/ >>> hbase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success| >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups: >>> [] >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>> ID2%2Fd%3Araw|unavailable|Request >>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest> >>> method: >>> GET >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>> ID2%2Fd%3Araw|success|Response >>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse> >>> status: >>> 200 >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR >>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response >>> status: 200 >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP >>> authentication failed. >>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/ >>> hbase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401 >> >> >> The things to highlight here for the same correlation id: >> * different topologies are being used >> * different uris are being used >> * different users are being used >> >> Some of the things that we have configured that could impact results: >> * authentication caching >> * multiple Knox servers >> * load balancer in front of Knox >> >> I would be surprised if those things would make one Knox node (it could >> affect our other Knox nodes haven't looked yet) have duplicate correlation >> ids. >> >> From Larry Mccay: >> >>> Interesting... >>> >>> I have not seen this before at all. >>> I don't think there is anyway for those things to affect it either - >>> especially the HA related things. >>> Those entries are all a single request that will not go to different >>> hosts - unless there is are redirects involved and then it should become a >>> separate correlation id. >>> >>> We should dig into the creation of the id to see what could cause it. >>> >> >> From myself: >> >>> Yea that is what I was afraid of :) I looked a bit further and it isn't >>> every request that has duplicates. The ones I have found so far are from >>> LDAP authentication failures. There could very well be others since I've >>> only been looking at authentication failures. >>> I'll see if my team can dig some more information up. I'm hoping to get >>> a few more people involved with Apache Knox. It will probably take a few >>> days to dig further. >> >> >> Kevin Risden >> > >
