Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to pandas and it has even better performance.
It is also a strong candidate due to the expressiveness of the sql dialect including support for analytical/windowing functions. There is a latency hit: on the order of a couple of seconds to start the SparkContext - but pandas is not a high performance tool in any case. i see that OpenRefine is implemented in Java so then Spark local should be a very good complement to it. On Sat, 4 Jul 2020 at 08:17, Antonin Delpeuch (lists) < li...@antonin.delpeuch.eu> wrote: > Hi, > > I am working on revamping the architecture of OpenRefine, an ETL tool, > to execute workflows on datasets which do not fit in RAM. > > Spark's RDD API is a great fit for the tool's operations, and provides > everything we need: partitioning and lazy evaluation. > > However, OpenRefine is a lightweight tool that runs locally, on the > users' machine, and we want to preserve this use case. Running Spark in > standalone mode works, but I have read at a couple of places that the > standalone mode is only intended for development and testing. This is > confirmed by my experience with it so far: > - the overhead added by task serialization and scheduling is significant > even in standalone mode. This makes sense for testing, since you want to > test serialization as well, but to run Spark in production locally, we > would need to bypass serialization, which is not possible as far as I know; > - some bugs that manifest themselves only in local mode are not getting > a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so > it seems dangerous to base a production system on standalone Spark. > > So, we cannot use Spark as default runner in the tool. Do you know any > alternative which would be designed for local use? A library which would > provide something similar to the RDD API, but for parallelization with > threads in the same JVM, not machines in a cluster? > > If there is no such thing, it should not be too hard to write our > homegrown implementation, which would basically be Java streams with > partitioning. I have looked at Apache Beam's direct runner, but it is > also designed for testing so does not fit our bill for the same reasons. > > We plan to offer a Spark-based runner in any case - but I do not think > it can be used as the default runner. > > Cheers, > Antonin > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >