Over the last months I have been working on an improved low-latency version of Kraken. This work has been carried out more or less in secret, and has not been announced until now. The point of this version is to determine how quickly a well funded attacker can break A5/1. The code is reasonable well tested, but doesn't have any big performance gains except if you are using SSD disks, or have a slower CPU.
What I have done is to write a new GPU A5/1 kernel that can both calculate chains and relieve the CPU from the final search operation. This kernel is slower than the bitsliced kernel when it comes to bulk processing A5/1 chains, but since each GPU thread only calculates 1 or 2 A5/1 instances, the latency (time it takes to complete a single chain) is noticeably lower. On my machine with 2 GPUs I can (re)crack a burst in under 5 seconds (with all look-ups cached in RAM). It is reasonable to assume that with the same code given enough GPU / SSD storage, it should be possible to perform the cracking operation in less than 2 seconds. I have made the code available for evaluation now, but it isn't properly integrated into the mainline (yet) http://traxme.net/a5/a5_ilx/ Steps to build: 1) Untar and build 2) Place A5IlStubs.cpp & A5IlStubs.h in the Kraken build folder 3) Delete A5Cpu.so and A5Ati.so from the Kraken directory if present. Copy A5Il.so this location instead. 4) Insert #include "A5IlStubs.h" prior to both includes of "../a5_cpu/A5CpuStubs.h" 5) Add A5IlStubs.cpp to the build command (build.sh) 7) Rebuild Kraken & run This procedure replaces the A5Cpu.so library with A5Il.so which contains the new code. The new GPU code uses the "bit population count" instruction, introduced with DX11 to quickly count the number of bits set in the taps of the 3 LFSRs. This instruction is somtimes jokingly refered to as the "canonical NSA instruction", since said agency had a habit of procuring computers that had such a CPU instruction. Also there is a neat assembly level optimization using ulerp4 (The only instruction that has 3 distinct sources) to shave of some cycles. The A5/1 clocking routine is as follows: ; Clock r100 - output to r110 func 11 ushr r101.xyz,r100.xyz,l12.xyz and r102.xyz,r101.xyz,l14.xyz u4lerp r102.w,r102.x,r102.y,r102.z ixor r102.w,r102.w,l1.w ixor r100.w,r102.w,r102.z and r101.xyz,r101.xyz,l1.xyz ixor r102.x,r101.x,r102.w ixor r102.y,r101.y,r102.w ixor r102.z,r101.z,r102.w and r104,r100,l13 icbits r106,r104 and r106.xyz,r106.xyz,r102.xyz ishl r100.xyz,r100.xyz,r102.xyz ior r100.xyz,r100.xyz,r106.xyz and r106.w,r106.w,l1.w ishl r110.x,r110.x,l1.x ior r110.x,r110.x,r106.w ret The library automatically switches between 1 and 2 instances pr thread, depending on load (A debug message is left in place). A further improvement may be to have a version with 4 instances pr thread, but that is partly outside the scope of just making a low-latency version. Summary: A well funded adversary can break A5/1 in near realtime (1-2 seconds) using only publicly available tables and programs. This will be most useful for intercepting frequency hopping voice calls on large and crowded cells. cheers, Frank _______________________________________________ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51