Responding to myself to add some more detail after digging deeper:

More evidence from duplicate correlation ids for the following cases:

   - this affects different services (same correlation id, different
   services being called)
   - multiple successful individual requests with same correlation id

I'll try anonymize them and share to this thread in the next few days.

We had a few theories that we are slowly working through

   - Correlation id generation is returning duplicates?
   - Status: Basically disproven
      -
      
https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37
      - Wrote a small Java program to test this.
         - UUID.randomUUID().toString in multiple threads
         - No duplicates returned for 100+ threads and 10000 ids generated
         quickly
      - Pretty sure the random number generator is ok
   - Requests from same IP address = same correlation id?
      - Status: Disproven
      - This doesn't make sense for the different services
      - Different users would be hitting different services (hive vs hbase)
   - If it only occurs with authentication failures could be the system
   recycling the ID?
      - Status: Not explored yet
      - Active conversations require a unique ID, so it might be that auth
      fail 401 = conversation over and UUID recycling.
      - Not sure how to test this right now
   - CorrelationContext or CorrelationService or CorrelationHandler is
   getting confused?
      - The filter is adding the same id multiple times?
      - Asynchronous calls are being handled differently?

As of right now we don't have a smoking gun as to what is happening, but I
would guess there is something in Knox causing the duplicates to happen.

Any ideas on where to look next would be appreciated.



Kevin Risden

On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <[email protected]> wrote:

> From the Knox User list thread: "Multiple topology audit logging", it came
> to my attention that Knox seems to be logging duplicate correlation ids.
> Separating out the topic specifically here to dig a bit deeper.
>
> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the
> "correlation id" doesn't seem to be unique across requests. Is this to be
> expected? Here is a snippet (anonymized):
>
> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log
>>
>
>
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST
>> /hbase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups:
>> []
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&
>> row=ID2%2Fd%3Araw|unavailable|Request
>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest>
>>  method:
>> GET
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&
>> row=ID2%2Fd%3Araw|success|Response
>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse>
>>  status:
>> 200
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR
>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response
>> status: 200
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP
>> authentication failed.
>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST
>> /hbase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401
>
>
> The things to highlight here for the same correlation id:
> * different topologies are being used
> * different uris are being used
> * different users are being used
>
> Some of the things that we have configured that could impact results:
> * authentication caching
> * multiple Knox servers
> * load balancer in front of Knox
>
> I would be surprised if those things would make one Knox node (it could
> affect our other Knox nodes haven't looked yet) have duplicate correlation
> ids.
>
> From Larry Mccay:
>
>> Interesting...
>>
>> I have not seen this before at all.
>> I don't think there is anyway for those things to affect it either -
>> especially the HA related things.
>> Those entries are all a single request that will not go to different
>> hosts - unless there is are redirects involved and then it should become a
>> separate correlation id.
>>
>> We should dig into the creation of the id to see what could cause it.
>>
>
> From myself:
>
>> Yea that is what I was afraid of :) I looked a bit further and it isn't
>> every request that has duplicates. The ones I have found so far are from
>> LDAP authentication failures. There could very well be others since I've
>> only been looking at authentication failures.
>> I'll see if my team can dig some more information up. I'm hoping to get a
>> few more people involved with Apache Knox. It will probably take a few days
>> to dig further.
>
>
> Kevin Risden
>

Reply via email to