I've never heard or done that type of testing for a large dataset solely on python, so I don't know what's the cap from the memory standpoint that python can handle base on memory availability. Now, if I understand what you are trying to do, you can achieve that by leveraging Apache Spark and invoking "pyspark" where you can store data in memory and/or hard disk. Also, if you are working with Hadoop, you can use spark to move/transfer data back-and-forth.
Thank You, Irving Duran On Tue, Jan 2, 2018 at 12:06 PM, <ja...@apkudo.com> wrote: > I'm not sure if I'll be laughed at, but a statistical sampling of a > randomized sample should resemble the whole. > > If you need min/max then min ( min(each node) ) > If you need average then you need sum( sum(each node)) sum(count(each > node))* > > *You'll likely need to use log here, as you'll probably overflow. > > It doesn't really matter what numpy can nagle you just need to collate the > data properly, defer the actual calculation until the node calculations are > complete. > > Also, numpy should store values more densely than python itself. > > > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list