On Fri, Mar 11, 2016 at 5:09 PM, Kim Liu <k...@edgewaternetworks.com> wrote:

> Just for sake of clarification, then, what is the use-case for having UDFs
> in an UPDATE?
>

Honestly, it's merely there for convenience when you use things like cqlsh
for instance.


>
> If they cannot read data from the data store, then all of the parameters
> to the UDF must be supplied by the client, correct?
>

Correct (for UPDATE at least).


>
> If the client has all the parameters, the client could perform the
> equivalent of the UDF on the client side, first, then send the results to
> the server, instead of pushing the computation work onto the server.
>

Absolutely.


>  So I am curious as to what one is supposed to use a UDF in an UPDATE for.
>

Again, mainly a convenience.

The main end goal for UDF is for use in SELECT. It's already potentially
slightly useful to save server->client bandwidth as you could imagine to do:
  SELECT compute_md5(image) FROM images WHERE ...;
(assuming of course that trading cpu server side for bandwidth is a good
idea)

Though their most useful use will probably be in WHERE clause, for things
like:
  SELECT * FROM foo WHERE sqrt(x) = 3;
and we also plan to have functional indexes to go with that.
However, I'll not right away that those last use case are not yet
supported, but they will be eventually. Adding UDF was more of the first
incremental step, but their most interesting use case is arguably not yet
supported. But as far as UPDATE is concerned, we'll probably never support
them as I said since that would require a read-before-write (except
possibly for LWT which do that read-before-write (at great cost) anyway,
we'll see).


>
>
>
> Long-winded explanation of the use-case I was poking at using UPDATE UDFs
> for below for the morbidly curious.
>
>
>
>
> That morbidly curious, huh?
>
> The scenario is, roughly, that the application receives a set of data
> which is broken up over, say, four messages (A,B,C,D).  However, the
> messages can arrive in any order, possibly with duplicates, and the data
> set is not complete until the all four messages are received.  There are
> multiple message receivers in order to scale to the volume of messages
> coming in, so each of the four messages per data set could arrive at any
> receiver (in any chronological pattern), and each receiving station would
> then insert the partial data into Cassandra.
>
> I looked at the Cassandra SET implementation, thinking that I could just
> add ‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index.  Then
> periodically search for where the set had all elements to spot which rows
> had a complete data set ready for processing.  However, there does not
> appear to be an equality check for SETs.  (Adding elements to a set is
> another place where UPDATE appears to allow for the “x = x <operator>
> <data>” pattern which added to my confusion about using a UDF in the
> UPDATE.)
>
> So instead of using sets, the idea was to have a UDF perform a bit-wise OR
> operation.  Roughly:
>   CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS
> int LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null
> ? 0 : b));';
>
> Then as each message segment came in, I had intended, roughly:
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2),
> data2=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1),
> data1=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8),
> data4=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4),
> data3=… ;
>
> Then, with a secondary index on ‘messageComplete’, periodically scrape out
> all rows where messageComplete was equal to 15.  (At most, sixteen unique
> values in the secondary index.)  (And use a TTL to expire messages that did
> not eventually complete, etc.  Boilerplate infrastructure, etc.)
>
> This was based upon my incorrect assumption about UPDATE UDFs, since this
> looked like an optimal way to avoid having all the clients perform
> read-updates patterns and worrying about the clients stepping on each
> others data, as well as handling cases where duplicate messages were
> received by different receivers.  So it’s starting to look like I might
> need to use something else to perform the correlation between messages.
>
> —Kim
>
> From: Sylvain Lebresne <sylv...@datastax.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Friday, March 11, 2016 at 00:35
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Using User Defined Functions in UPDATE queries
>
> UDF are usable in UPDATE statement as actually trying them shows, it's
> just the documented grammar that needs fixing.
>
> But as far as doing something like:
>   UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
> this is indeed *not* supported and likely never will. One big pillar of C*
> design is that normal writes like this don't do a read-before-write, both
> for performance and because of consistency constraints, so we can't have
> update depend on the previous value in any way.
> I'll note that maybe that make UDF useless for you and if so, I'm sorry,
> but you just can't use UDF in C* for that and you'd have to do a manual
> read-before-write client side to achieve this.
>
> For the sake of avoiding confusion, I will not that we do allow:
>   UPDATE test_table SET c = c + 1 WHERE idx='abc';
> if c is a counter, but that's a very special case. Counters have a
> completely separate path and implementation and do have a read-before-write
> (and are slower than normal update as a result).
>
>
>

Reply via email to