hi corey not familiar with arrow, plasma. However recently read an article about spark on a standalone machine (your case). Sounds like you could take benefit of pyspark "as-is"
https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html regars, 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>: > Please forgive me if this question has been asked already. > > I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if > anyone knows of any efforts to implement the PySpark API on top of Apache > Arrow directly. In my case, I'm doing data science on a machine with 288 > cores and 1TB of ram. > > It would make life much easier if I was able to use the flexibility of the > PySpark API (rather than having to be tied to the operations in Pandas). It > seems like an implementation would be fairly straightforward using the > Plasma server and object_ids. > > If you have not heard of an effort underway to accomplish this, any > reasons why it would be a bad idea? > > > Thanks! >