Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi Juan, Of course! My prototype is here: https://github.com/OpenRefine/OpenRefine/tree/spark-prototype I suspect it can be quite hard for you to jump in the code at this stage of the project, but here are some concise pointers: The or-spark module contains the Spark-based implementation of our

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén
Would you be able to send the code you are running?That would be great if you include some sample data. Is that possible? El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) escribió: Hi Stephen and Juan, Thanks both for your replies - you are right, I used the

Re: Cassandra raw deletion

2020-07-04 Thread Alex Ott
Yes, you can do it using the RDD API of Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-connector/blob/b2.5/doc/5_saving.md#deleting-rows-and-columns Depending on you if you're deleting only specific columns, or full rows, it's recommended to look to the keyColumns

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi Stephen and Juan, Thanks both for your replies - you are right, I used the wrong terminology! The local mode is what fits our needs best (and what I have benchmarking so far). That being said, the problems I mention are still applicable to this context. There is still a serialization overhead

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén
Hi Antonin. It seems you are confusing Standalone with Local mode. They are 2 different modes. >From Spark in Action book: "In local mode, there is only one executor in the >same client JVM as the driver, butthis executor can spawn several threads to >run tasks. In local mode, Spark uses your

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to

RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi, I am working on revamping the architecture of OpenRefine, an ETL tool, to execute workflows on datasets which do not fit in RAM. Spark's RDD API is a great fit for the tool's operations, and provides everything we need: partitioning and lazy evaluation. However, OpenRefine is a lightweight

Cassandra raw deletion

2020-07-04 Thread Amit Sharma
Hi, I have to delete certain raw from Cassandra during my spark batch process. Is there any way to delete Rawat using spark Cassandra connector. Thanks Amit