Over the last months I have been working on an improved low-latency
version of Kraken. This work has been carried out more or less in secret,
and has not been announced until now. The point of this version is to
determine how quickly a well funded attacker can break A5/1. The code is
reasonable well tested, but doesn't have any big performance gains except
if you are using SSD disks, or have a slower CPU.

What I have done is to write a new GPU A5/1 kernel that can both calculate
chains and relieve the CPU from the final search operation. This kernel is
slower than the bitsliced kernel when it comes to bulk processing A5/1
chains, but since each GPU thread only calculates 1 or 2 A5/1 instances,
the latency (time it takes to complete a single chain) is noticeably
lower. On my machine with 2 GPUs I can (re)crack a burst in under 5
seconds (with all look-ups cached in RAM). It is reasonable to assume that
with the same code given enough GPU / SSD storage, it should be possible
to perform the cracking operation in less than 2 seconds.

I have made the code available for evaluation now, but it isn't properly
integrated into the mainline (yet)

http://traxme.net/a5/a5_ilx/

Steps to build:
1) Untar and build
2) Place A5IlStubs.cpp & A5IlStubs.h in the Kraken build folder
3) Delete A5Cpu.so and A5Ati.so from the Kraken directory if present. Copy
A5Il.so this location instead.
4) Insert #include "A5IlStubs.h" prior to both includes of
"../a5_cpu/A5CpuStubs.h"
5) Add A5IlStubs.cpp to the build command (build.sh)
7) Rebuild Kraken & run

This procedure replaces the A5Cpu.so library with A5Il.so which contains
the new code. The new GPU code uses the "bit population count"
instruction, introduced with DX11 to quickly count the number of bits set
in the taps of the 3 LFSRs. This instruction is somtimes jokingly refered
to as the "canonical NSA instruction", since said agency had a habit of
procuring computers that had such a CPU instruction. Also there is a neat
assembly level optimization using ulerp4 (The only instruction that has 3
distinct sources) to shave of some cycles. The A5/1 clocking routine is as
follows:

; Clock r100  - output to r110
func 11
ushr r101.xyz,r100.xyz,l12.xyz
and  r102.xyz,r101.xyz,l14.xyz
u4lerp r102.w,r102.x,r102.y,r102.z
ixor r102.w,r102.w,l1.w
ixor r100.w,r102.w,r102.z
and  r101.xyz,r101.xyz,l1.xyz
ixor r102.x,r101.x,r102.w
ixor r102.y,r101.y,r102.w
ixor r102.z,r101.z,r102.w
and  r104,r100,l13
icbits r106,r104
and  r106.xyz,r106.xyz,r102.xyz
ishl r100.xyz,r100.xyz,r102.xyz
ior  r100.xyz,r100.xyz,r106.xyz
and  r106.w,r106.w,l1.w
ishl r110.x,r110.x,l1.x
ior  r110.x,r110.x,r106.w
ret

The library automatically switches between 1 and 2 instances pr thread,
depending on load (A debug message is left in place). A further
improvement may be to have a version with 4 instances pr thread, but that
is partly outside the scope of just making a low-latency version.

Summary:
A well funded adversary can break A5/1 in near realtime (1-2 seconds)
using only publicly available tables and programs. This will be most
useful for intercepting frequency hopping voice calls on large and crowded
cells.

cheers,
Frank
_______________________________________________
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to