Re: [casper] How could this DRAM fetch be made faster? {External}

Ken Semanov Tue, 02 Jul 2024 17:57:13 -0700

Thanks to everyone for the replies.

I am using a CASPERized board that is using the transport layer in 
transport_katcp.py  .  Using the same CASPER methodology described above 
(with ss_r  snapshot intermediate),  I modified the code in several places, 
removing endian swaps and other non-optimal practices. (snap.py , 
casperfpga.py, transport_katcp.py)   The ss_r was changed to a larger 
depth.  Those changes resulted in a data rate of 7.2 megabytes/sec.


I then attempted to split the ss_r block into two blocks, fill them both, 
then retrieve from both of them concurrently.  Since python GIL does not 
allow spreading threads between cores,  I used multiprocessing instead.  
That was a catastrophe.  All manner of runtime errors within the katcp 
transport code.  (Presumably that is happening because the OS does not 
believe the transport layers of the forked processes have any access to the 
ethernet.)  

Rumor is that python GIL is not I/O bound, and that threading would 
actually help here.  I will try this, but I have my doubts. 

On Monday, July 1, 2024 at 9:08:28 AM UTC-4 Matthew Schiller wrote:

> I second Jack’s approach with exposing the DRAM memory.  You’ll likely get 
> much faster access that way.  But you’ll need to be able to write 
> sufficient software to access the memory and any flow control necessary.
>
>  
>
> There’s Xilinx IP for that that could be cobbled together, really all you 
> have to do on the “FPGA” side is connect the DDR4 to the Zynq’s AXI – Full 
> interface and Block Diagram tool should allow software to read/write the 
> memory directly (zynq addressing is fun to decipher though….)   On the 
> software side, there should be reference designs for read/writing memory 
> out there..  Python for example can do this with pydevmem, since AXI memory 
> space is probably already setup at the kernel level to be exposed as 
> /dev/mem.  Fundamentally I’d expect faster performance if you can 
> read/write directly to the DDR4 then going through another programmed i/o 
> indirect addressing scheme which seems to be what your “snapshot block” 
> is.   Using the full memory you can wait until you have 1+Mbyte of data to 
> read and then burst it out using the much faster AXI-Full Interface in a 
> very large chunk of data.
>
>  
>
> Once in python (or C++ if you prefer) you should be able to form ethernet 
> packets in whatever format you’d like and stream over the network.  I’d 
> expect to be able to get at least 100Mbps (12.5MBps)  using this approach, 
> though you’re CPU utilization on the Zynq is going to spike to do it that 
> way, because you’ll still going to be using software to  read from the FPGA 
> connected DDR4 into the Zynq’s ARM processor’s memory space before forming 
> packets and sending them out again.  Fundamentally there’s a lot of memory 
> copies going on.  But it shouldn’t matter at <1Gbps type speeds.  Unless 
> you’re doing a lot of other stuff on the Zynq’s ARM processor
>
>  
>
> There is faster ways to do what your describe here, but they probably 
> aren’t necessary at the rates you suggest here But for future use:
>
>    - Data can be DMA’ed between the FPGA and the CPU memory
>       - This could be done using an FPGA based  Scatter Gather Direct 
>       Memory Access (SGDMA) block or the DMA blocks built into the ARM 
> processor
>       - Problem:  You’ll generally need to consider writing device 
>       drivers to make any of that happen…  So software complexity is higher.
>       - This will probably get you to >500Mbits/sec but I’d be surprised 
>       if you could get to above about 2Gbit/sec on a zync processing ethernet 
>       packets in software
>          - If memory serves I’ve seen folks get to about 4Gbit/sec with a 
>          DMA based 10G ethernet port/driver connected to software
>       - You might be able to skip using FPGA memory at all, since you can 
>       SGDMA AXI stream data (think ADC or processed data) directly into CPU 
>       memory…..
>          - This is efficient, but means a lot of memory utilization in 
>          your application
>       - Ethernet Packets could be generated in the FPGA and sent over an 
>    FPGA connected Ethernet port
>       - Basically limited either by DDR4 bandwidth or Ethernet bandwidth 
>       (eg could saturate a 100G ethernet, on your QSFP28 port with this 
> approach 
>       no problem….)
>    
>  
>
>  
>
>  
>
>  
>
> [image: AB72FAB9]
>
> Matthew Schiller
>
> ngVLA Digital Backend Lead
>
> NRAO
>
>  
>
> [email protected]
>
> 315-316-2032 <(315)%20316-2032>
>
>  
>
>  
>
>  
>
> Matthew Schiller
>
>  
>
>  
>
>  
>
> *From:* [email protected] <[email protected]> *On Behalf 
> Of *Jack Hickish
> *Sent:* Monday, July 1, 2024 6:30 AM
> *To:* [email protected]
> *Subject:* Re: [casper] How could this DRAM fetch be made faster? 
> {External}
>
>  
>
> Hi Ken,
>
>  
>
> That is surprisingly slow - I assume triggers are coming to the snapshot 
> regularly and this is not slowing things down(?). 
>
> I'd guess that the bigger you make the snapshot block the higher 
> throughput you would get. I would also guess that if you dispense with the 
> snapshot block control software and just manipulate the command register 
> yourself followed by reading the embedded software bram this might be 
> faster, depending what messing around casperfpga is doing with status 
> registers, etc. Certainly not checking the bram has been written 
> (nowait=True) will speed things up, but you need to be sure what you are 
> reading is valid.
>
>  
>
> All that said, is it difficult to directly expose the an AXI interface to 
> the DRAM memory to the Zynq CPU so that you can read it directly without 
> involving casperfpga? I would assume that there are many xilinx reference 
> block diagrams which do something like this.
>
> Certainly doing something akin to "bulk read" would probably speed things 
> up by issuing one command and getting loads of data back, but if you can 
> just make the whole DRAM addressable from the Zynq that would probably be 
> easier / better.
>
>  
>
> Hope that waffle is somewhere between neutral and helpful,
>
>  
>
> Jack
>
>  
>
> On Thu, 27 Jun 2024 at 22:21, Ken Semanov <[email protected]> wrote:
>
> We require to fetch the the entire contents of the 4GB DDR4 C0 clamshell 
> and transmit the data to a nearby server as efficiently as possible.  A 
> method was attempted which uses an intermediate snapshot block of 16KB. The 
> block is repeatedly armed and fetched until the contents of DDR4 is 
> retrieved. Server-side code for the receive uses Snap.read_raw() to avoid 
> post-processing to int, as is shown in the pseudocode on the right in the 
> diagram below.
>
>  
>
> This method performs at a paltry rate *608 kBps (kilobytes/sec).*  
>
>  
>
> Our use-case requires this to be a factor of 4 faster. Data rates of 2400 
> kBps and faster, are very likely supported by this network configuration.  
> Ports g0/1 and g0/8 are configured to speed 100M in full duplex. Both TX 
> and RX ethernet connections are using mtu=9000. In previous (unrelated) 
> tests performed here on the ZCU216, data rates up to 1789 kBps were 
> observed via the same ethernet port.
>
>  
>
> Which portion of casperfpga is the likely culprit for the low data? 
>
>  
>
> The reported timing of 608 kBps  includes any preparatory stages before 
> the bytes are sent through the network.  (e.g. the busy wait on 
> ['status']['register']  in class Snap. line 279 of 
> https://github.com/casper-astro/casperfpga/blob/master/src/snap.py  ) Is 
> there anything to optimize in transport_katcp.py ?  
>
>  
>
> Should the method of an intermediate snapshot be abandoned, in favor of a 
> different approach altogether?   For example, something like a "DRAM bulk 
> read" per line 414 of  
> https://github.com/casper-astro/casperfpga/blob/master/src/casperfpga.py
>
>  
>
> Your thoughts? 
>
>  
>
> [image: fasterxthernetfetch.png]
>
> -- 
> You received this message because you are subscribed to the Google Groups "
> [email protected]" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/0f7f6bfe-8550-4dc0-a83d-5da15e92a125n%40lists.berkeley.edu
>  
> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/0f7f6bfe-8550-4dc0-a83d-5da15e92a125n%40lists.berkeley.edu?utm_medium=email&utm_source=footer>
> .
>
> -- 
> You received this message because you are subscribed to the Google Groups "
> [email protected]" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
>
> To view this discussion on the web visit 
> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG1GKS%3D_rU0G-7Nr85%2Bwho8URQ88oCY5foFTJH%3D1OL__OFizAQ%40mail.gmail.com
>  
> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG1GKS%3D_rU0G-7Nr85%2Bwho8URQ88oCY5foFTJH%3D1OL__OFizAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/97e99722-05ae-4415-98c3-95fc817002ban%40lists.berkeley.edu.

Re: [casper] How could this DRAM fetch be made faster? {External}

Reply via email to