karl3@writeme.com wrote:
> karl3@writeme.com wrote:
> > {{{{{{there was some energy around making a network-based inference engine,
> > maybe by modifying deepseek.cpp (don't quite recall why not staying in
> > python, some concern arose)
> > task got weak, found cinatra as a benchmark leader for c++ web engines
> > (although pico.v was the top! (surprised all c++ http engines were beaten
> > by java O_O very curious about this, wondering if it's a high-end test
> > system) never heard of V language but is interesting it won a leaderboard)
> > inhibition ended up discovering concern somewaht like ... on this 4GB ram
> > system it might take 15-33GB of network transfor for each forward pass of
> > the model ... [multi-token passes ^^
> > karl3@writeme.com wrote:
> > the concern resonates with difficulty making the implementation, and some
> > form of inhibition or concern around using python. notably, i've made
> > offloading python hooks a lot and they never last due to the underlying
> > interfaces changing (although those interfaces have stabilized much more
> > now as hf made accelerator as their official implementation) (also i think
> > the issue is more severe dissociative associations than the interface, if
> > one considers the possibility of personal maintenance and use rather than
> > usability for others). don't immediately recall the python concern
> > seem to be off task or taking break, but it would make sense to do disk
> > caching too. there is also option of quantizing. basically, LLMs and AI in
> > general place r&d effort between the user and ease, smallness, cheapness,
> > power, etc
I poked at python again. The existing implementations of the 8 bit quantization
used by the model all require an NVidia GPU which I do not presently have. It
is fun to imagine making it work, maybe I can upcast it to float32 or something
>)