Hi Juan,
Of course! My prototype is here:
https://github.com/OpenRefine/OpenRefine/tree/spark-prototype
I suspect it can be quite hard for you to jump in the code at this stage
of the project, but here are some concise pointers:
The or-spark module contains the Spark-based implementation of our
Would you be able to send the code you are running?That would be great if you
include some sample data.
Is that possible?
El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists)
escribió:
Hi Stephen and Juan,
Thanks both for your replies - you are right, I used the
Yes, you can do it using the RDD API of Spark Cassandra Connector:
https://github.com/datastax/spark-cassandra-connector/blob/b2.5/doc/5_saving.md#deleting-rows-and-columns
Depending on you if you're deleting only specific columns, or full rows,
it's recommended to look to the keyColumns
Hi Stephen and Juan,
Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).
That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead
Hi Antonin.
It seems you are confusing Standalone with Local mode. They are 2 different
modes.
>From Spark in Action book: "In local mode, there is only one executor in the
>same client JVM as the driver, butthis executor can spawn several threads to
>run tasks.
In local mode, Spark uses your
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
Hi,
I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.
Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.
However, OpenRefine is a lightweight
Hi, I have to delete certain raw from Cassandra during my spark batch
process. Is there any way to delete Rawat using spark Cassandra connector.
Thanks
Amit