How to perform distributed compute in similar way to Spark vector UDF

camer314 Sun, 17 Nov 2019 17:51:03 -0800

I asked this question on  StackOverflow
<https://stackoverflow.com/questions/58759138/apache-ignite-analogue-of-spark-vector-udf-and-distributed-compute-in-general/58766331#58766331>


However I probably put too much weight on Spark.

My question really is, how can I load in a large CSV file to the cache and
send compute actions to the nodes which work in a similar way to Pandas UDF.
That is, they work on a subset of the data (rows).

In Ignite I imagine I could load the CSV to a cache using PARTITION mode and
then using affinity compute send functions to the nodes where the data is,
so each node is processing only the data that exists on it. This seems like
a nice way to go, each node is always only processing locally, and the
results of those actions would be adding back to the cache, so presumably
would only add locally as well.

However, I am not entirely sure how the partitioning works. The examples for
affinity show using a single key value.

Is there a way to load a CSV into a cache in PARTITION mode, so Ignite
evenly distributes across the grid but then run a compute job on every node
that works ONLY with the data in its own cache, that way i wont need to care
about keys?

For example, imagine a CSV file that is a matrix of numbers. My distributed
cache would really be a dataframe representation of that file. For arguments
sake lets say my cache is keyed by an increment ID with the data being an
array of doubles and the column names are A,B,C

That ID key is really pretty irrelevant. Its is meaningless to my
application.

Now lets say I wanted to perform the same maths on every row in that
dataframe, with the results being a new column in the cache.

If that formula was D = A * B * C then D becomes a new column.

Ignoring Spark SQL, in Spark I could write a UDF easily that creates column
D by passing columns [A,B,C]. Spark doesnt care about keys or ID columns in
this instance, it just gives you a vector of data and you return a vector of
results.

So in Ignite, how can i replicate that behaviour the most elegantly in code
(.NET), send compute to the grid that collectively processes all rows
without caring about the keys?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

How to perform distributed compute in similar way to Spark vector UDF

Reply via email to