Sandeep - Any progress here? Any more ideas? We haven't had a chance to look much closer yet but if you need more logs let me know.
Kevin Risden On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <[email protected]> wrote: > This is interesting (and troubling). Thanks for the detailed analysis > Kevin, this really helps to narrow down the problem. I will try to take a > look at Auth failures and Concurrency issue that are still to be > investigated. > > Best, > Sandeep > > On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <[email protected]> wrote: > >> Responding to myself to add some more detail after digging deeper: >> >> More evidence from duplicate correlation ids for the following cases: >> >> - this affects different services (same correlation id, different >> services being called) >> - multiple successful individual requests with same correlation id >> >> I'll try anonymize them and share to this thread in the next few days. >> >> We had a few theories that we are slowly working through >> >> - Correlation id generation is returning duplicates? >> - Status: Basically disproven >> - https://github.com/apache/knox/blob/master/gateway-server/ >> src/main/java/org/apache/hadoop/gateway/filter/Correlat >> ionHandler.java#L37 >> >> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37> >> - Wrote a small Java program to test this. >> - UUID.randomUUID().toString in multiple threads >> - No duplicates returned for 100+ threads and 10000 ids >> generated quickly >> - Pretty sure the random number generator is ok >> - Requests from same IP address = same correlation id? >> - Status: Disproven >> - This doesn't make sense for the different services >> - Different users would be hitting different services (hive vs >> hbase) >> - If it only occurs with authentication failures could be the system >> recycling the ID? >> - Status: Not explored yet >> - Active conversations require a unique ID, so it might be that >> auth fail 401 = conversation over and UUID recycling. >> - Not sure how to test this right now >> - CorrelationContext or CorrelationService or CorrelationHandler is >> getting confused? >> - The filter is adding the same id multiple times? >> - Asynchronous calls are being handled differently? >> >> As of right now we don't have a smoking gun as to what is happening, but >> I would guess there is something in Knox causing the duplicates to happen. >> >> Any ideas on where to look next would be appreciated. >> >> >> >> Kevin Risden >> >> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> wrote: >> >>> From the Knox User list thread: "Multiple topology audit logging", it >>> came to my attention that Knox seems to be logging duplicate correlation >>> ids. Separating out the topic specifically here to dig a bit deeper. >>> >>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the >>> "correlation id" doesn't seem to be unique across requests. Is this to be >>> expected? Here is a snippet (anonymized): >>> >>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log >>>> >>> >>> >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success| >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups: >>>> [] >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>> ID2%2Fd%3Araw|unavailable|Request >>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest> >>>> method: >>>> GET >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>> ID2%2Fd%3Araw|success|Response >>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse> >>>> status: >>>> 200 >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR >>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response >>>> status: 200 >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP >>>> authentication failed. >>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401 >>> >>> >>> The things to highlight here for the same correlation id: >>> * different topologies are being used >>> * different uris are being used >>> * different users are being used >>> >>> Some of the things that we have configured that could impact results: >>> * authentication caching >>> * multiple Knox servers >>> * load balancer in front of Knox >>> >>> I would be surprised if those things would make one Knox node (it could >>> affect our other Knox nodes haven't looked yet) have duplicate correlation >>> ids. >>> >>> From Larry Mccay: >>> >>>> Interesting... >>>> >>>> I have not seen this before at all. >>>> I don't think there is anyway for those things to affect it either - >>>> especially the HA related things. >>>> Those entries are all a single request that will not go to different >>>> hosts - unless there is are redirects involved and then it should become a >>>> separate correlation id. >>>> >>>> We should dig into the creation of the id to see what could cause it. >>>> >>> >>> From myself: >>> >>>> Yea that is what I was afraid of :) I looked a bit further and it isn't >>>> every request that has duplicates. The ones I have found so far are from >>>> LDAP authentication failures. There could very well be others since I've >>>> only been looking at authentication failures. >>>> I'll see if my team can dig some more information up. I'm hoping to get >>>> a few more people involved with Apache Knox. It will probably take a few >>>> days to dig further. >>> >>> >>> Kevin Risden >>> >> >> >
