Re: Using User Defined Functions in UPDATE queries
Thank you for the clarification. —Kim From: Sylvain Lebresne <sylv...@datastax.com<mailto:sylv...@datastax.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Friday, March 11, 2016 at 10:05 To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Using User Defined Functions in UPDATE queries On Fri, Mar 11, 2016 at 5:09 PM, Kim Liu <k...@edgewaternetworks.com<mailto:k...@edgewaternetworks.com>> wrote: Just for sake of clarification, then, what is the use-case for having UDFs in an UPDATE? Honestly, it's merely there for convenience when you use things like cqlsh for instance.
Re: Using User Defined Functions in UPDATE queries
On Fri, Mar 11, 2016 at 5:09 PM, Kim Liu <k...@edgewaternetworks.com> wrote: > Just for sake of clarification, then, what is the use-case for having UDFs > in an UPDATE? > Honestly, it's merely there for convenience when you use things like cqlsh for instance. > > If they cannot read data from the data store, then all of the parameters > to the UDF must be supplied by the client, correct? > Correct (for UPDATE at least). > > If the client has all the parameters, the client could perform the > equivalent of the UDF on the client side, first, then send the results to > the server, instead of pushing the computation work onto the server. > Absolutely. > So I am curious as to what one is supposed to use a UDF in an UPDATE for. > Again, mainly a convenience. The main end goal for UDF is for use in SELECT. It's already potentially slightly useful to save server->client bandwidth as you could imagine to do: SELECT compute_md5(image) FROM images WHERE ...; (assuming of course that trading cpu server side for bandwidth is a good idea) Though their most useful use will probably be in WHERE clause, for things like: SELECT * FROM foo WHERE sqrt(x) = 3; and we also plan to have functional indexes to go with that. However, I'll not right away that those last use case are not yet supported, but they will be eventually. Adding UDF was more of the first incremental step, but their most interesting use case is arguably not yet supported. But as far as UPDATE is concerned, we'll probably never support them as I said since that would require a read-before-write (except possibly for LWT which do that read-before-write (at great cost) anyway, we'll see). > > > > Long-winded explanation of the use-case I was poking at using UPDATE UDFs > for below for the morbidly curious. > > > > > That morbidly curious, huh? > > The scenario is, roughly, that the application receives a set of data > which is broken up over, say, four messages (A,B,C,D). However, the > messages can arrive in any order, possibly with duplicates, and the data > set is not complete until the all four messages are received. There are > multiple message receivers in order to scale to the volume of messages > coming in, so each of the four messages per data set could arrive at any > receiver (in any chronological pattern), and each receiving station would > then insert the partial data into Cassandra. > > I looked at the Cassandra SET implementation, thinking that I could just > add ‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index. Then > periodically search for where the set had all elements to spot which rows > had a complete data set ready for processing. However, there does not > appear to be an equality check for SETs. (Adding elements to a set is > another place where UPDATE appears to allow for the “x = x > ” pattern which added to my confusion about using a UDF in the > UPDATE.) > > So instead of using sets, the idea was to have a UDF perform a bit-wise OR > operation. Roughly: > CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS > int LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null > ? 0 : b));'; > > Then as each message segment came in, I had intended, roughly: > UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2), > data2=… ; > UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1), > data1=… ; > UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8), > data4=… ; > UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4), > data3=… ; > > Then, with a secondary index on ‘messageComplete’, periodically scrape out > all rows where messageComplete was equal to 15. (At most, sixteen unique > values in the secondary index.) (And use a TTL to expire messages that did > not eventually complete, etc. Boilerplate infrastructure, etc.) > > This was based upon my incorrect assumption about UPDATE UDFs, since this > looked like an optimal way to avoid having all the clients perform > read-updates patterns and worrying about the clients stepping on each > others data, as well as handling cases where duplicate messages were > received by different receivers. So it’s starting to look like I might > need to use something else to perform the correlation between messages. > > —Kim > > From: Sylvain Lebresne <sylv...@datastax.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Friday, March 11, 2016 at 00:35 > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Using User Defined Functions in UPDATE queries > > UDF are usable in UPDATE statement as actually trying them show
Re: Using User Defined Functions in UPDATE queries
Just for sake of clarification, then, what is the use-case for having UDFs in an UPDATE? If they cannot read data from the data store, then all of the parameters to the UDF must be supplied by the client, correct? If the client has all the parameters, the client could perform the equivalent of the UDF on the client side, first, then send the results to the server, instead of pushing the computation work onto the server. So I am curious as to what one is supposed to use a UDF in an UPDATE for. Long-winded explanation of the use-case I was poking at using UPDATE UDFs for below for the morbidly curious. That morbidly curious, huh? The scenario is, roughly, that the application receives a set of data which is broken up over, say, four messages (A,B,C,D). However, the messages can arrive in any order, possibly with duplicates, and the data set is not complete until the all four messages are received. There are multiple message receivers in order to scale to the volume of messages coming in, so each of the four messages per data set could arrive at any receiver (in any chronological pattern), and each receiving station would then insert the partial data into Cassandra. I looked at the Cassandra SET implementation, thinking that I could just add ‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index. Then periodically search for where the set had all elements to spot which rows had a complete data set ready for processing. However, there does not appear to be an equality check for SETs. (Adding elements to a set is another place where UPDATE appears to allow for the “x = x ” pattern which added to my confusion about using a UDF in the UPDATE.) So instead of using sets, the idea was to have a UDF perform a bit-wise OR operation. Roughly: CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS int LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null ? 0 : b));'; Then as each message segment came in, I had intended, roughly: UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2), data2=… ; UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1), data1=… ; UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8), data4=… ; UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4), data3=… ; Then, with a secondary index on ‘messageComplete’, periodically scrape out all rows where messageComplete was equal to 15. (At most, sixteen unique values in the secondary index.) (And use a TTL to expire messages that did not eventually complete, etc. Boilerplate infrastructure, etc.) This was based upon my incorrect assumption about UPDATE UDFs, since this looked like an optimal way to avoid having all the clients perform read-updates patterns and worrying about the clients stepping on each others data, as well as handling cases where duplicate messages were received by different receivers. So it’s starting to look like I might need to use something else to perform the correlation between messages. —Kim From: Sylvain Lebresne <sylv...@datastax.com<mailto:sylv...@datastax.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Friday, March 11, 2016 at 00:35 To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Using User Defined Functions in UPDATE queries UDF are usable in UPDATE statement as actually trying them shows, it's just the documented grammar that needs fixing. But as far as doing something like: UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’; this is indeed *not* supported and likely never will. One big pillar of C* design is that normal writes like this don't do a read-before-write, both for performance and because of consistency constraints, so we can't have update depend on the previous value in any way. I'll note that maybe that make UDF useless for you and if so, I'm sorry, but you just can't use UDF in C* for that and you'd have to do a manual read-before-write client side to achieve this. For the sake of avoiding confusion, I will not that we do allow: UPDATE test_table SET c = c + 1 WHERE idx='abc'; if c is a counter, but that's a very special case. Counters have a completely separate path and implementation and do have a read-before-write (and are slower than normal update as a result).
Re: Using User Defined Functions in UPDATE queries
UDF are usable in UPDATE statement as actually trying them shows, it's just the documented grammar that needs fixing. But as far as doing something like: UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’; this is indeed *not* supported and likely never will. One big pillar of C* design is that normal writes like this don't do a read-before-write, both for performance and because of consistency constraints, so we can't have update depend on the previous value in any way. I'll note that maybe that make UDF useless for you and if so, I'm sorry, but you just can't use UDF in C* for that and you'd have to do a manual read-before-write client side to achieve this. For the sake of avoiding confusion, I will not that we do allow: UPDATE test_table SET c = c + 1 WHERE idx='abc'; if c is a counter, but that's a very special case. Counters have a completely separate path and implementation and do have a read-before-write (and are slower than normal update as a result). On Thu, Mar 10, 2016 at 11:11 PM, Kim Liu <k...@edgewaternetworks.com> wrote: > It does sounds like the use of UDF in UPDATE is in an ambiguous state at > the moment, then. The document grammar says they can’t be used, but the > document examples say they can, and the server will execute them, but it > can’t execute them in a useful way (i.e. no row supplied data.) > > So essentially not useable at the moment, regardless of intent. > > Thanks, > —Kim > > From: DuyHai Doan <doanduy...@gmail.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Thursday, March 10, 2016 at 14:03 > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Using User Defined Functions in UPDATE queries > > Surely an error because the grammar definition for UPDATE does not mention > any function call: > [SNIPPED] > > Unless the grammar in the doc is itself not up-to-date > >> >>>
Re: Using User Defined Functions in UPDATE queries
It does sounds like the use of UDF in UPDATE is in an ambiguous state at the moment, then. The document grammar says they can’t be used, but the document examples say they can, and the server will execute them, but it can’t execute them in a useful way (i.e. no row supplied data.) So essentially not useable at the moment, regardless of intent. Thanks, —Kim From: DuyHai Doan <doanduy...@gmail.com<mailto:doanduy...@gmail.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Thursday, March 10, 2016 at 14:03 To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Using User Defined Functions in UPDATE queries Surely an error because the grammar definition for UPDATE does not mention any function call: [SNIPPED] Unless the grammar in the doc is itself not up-to-date
Re: Using User Defined Functions in UPDATE queries
Surely an error because the grammar definition for UPDATE does not mention any function call: ::= UPDATE ( USING ( AND )* )? SET ( ',' )* WHERE ( IF ( AND condition )* )? ::= '=' | '=' ('+' | '-') ( | | ) | '=' '+' | '[' ']' '=' ::= '=' | '[' ']' '=' ::= ( AND )* ::= '=' | IN '(' ( ( ',' )* )? ')' | IN '?' ::= TIMESTAMP | TTL Unless the grammar in the doc is itself not up-to-date Le 10 mars 2016 22:31, "Kim Liu" <k...@edgewaternetworks.com> a écrit : > Um, I’m not entirely sure how I misread it, since this was copy-pasted > from the document: > UPDATE atable SET col = some_function(?) …; > > So the document examples certainly seem to support the use of UDF in > UPDATE. I suppose the document may be more erroneous in its writing than > I in its misreading. > > Additionally, this statement works in cqlsh (presuming max_int() is a UDF): > UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’; > > So, if the grammar is not supposed to allow this, then there is a bug > somewhere because in 3.3 it certainly seems to be parsed and executed > without complaint. > > —Kim > > > From: DuyHai Doan <doanduy...@gmail.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Thursday, March 10, 2016 at 13:21 > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Using User Defined Functions in UPDATE queries > > You have misread the CQL doc given in the link. According to CQL update > grammar it's not possible to use UDF. I see UDF only allowed in select > clause... > >> >>
Re: Using User Defined Functions in UPDATE queries
Um, I’m not entirely sure how I misread it, since this was copy-pasted from the document: UPDATE atable SET col = some_function(?) …; So the document examples certainly seem to support the use of UDF in UPDATE. I suppose the document may be more erroneous in its writing than I in its misreading. Additionally, this statement works in cqlsh (presuming max_int() is a UDF): UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’; So, if the grammar is not supposed to allow this, then there is a bug somewhere because in 3.3 it certainly seems to be parsed and executed without complaint. —Kim From: DuyHai Doan <doanduy...@gmail.com<mailto:doanduy...@gmail.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Thursday, March 10, 2016 at 13:21 To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: Using User Defined Functions in UPDATE queries You have misread the CQL doc given in the link. According to CQL update grammar it's not possible to use UDF. I see UDF only allowed in select clause...
Re: Using User Defined Functions in UPDATE queries
You have misread the CQL doc given in the link. According to CQL update grammar it's not possible to use UDF. I see UDF only allowed in select clause... Le 10 mars 2016 22:07, "Kim Liu"a écrit : > Hello - > I am experimenting with User Defined Functions in Cassandra (3.3) and I am > a bit puzzled by a problem I am having when testing them with cqlsh. I > have tried to find the answers online, but have not had any luck so far. > > According to http://cassandra.apache.org/doc/cql3/CQL.html it looks like > a UDF (User Defined Function) should be useable in an update: > UPDATE atable SET col = some_function(?) ...; > > Current test environment: > cassandra@cqlsh:test_space> show version; > [cqlsh 5.0.1 | Cassandra 3.3 | CQL spec 3.4.0 | Native protocol v4] > > > However, using a value from the row in question as an argument to the UDF > does not seem to work. Here is how I am testing: > CREATE KEYSPACE test_space WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': '1'}; > USE test_space; > CREATE TABLE test_table (idx text, data int, PRIMARY KEY( idx )); > INSERT INTO test_table(idx,data) VALUES( 'abc', 1 ); > CREATE FUNCTION max_int( a int, b int ) CALLED ON NULL INPUT RETURNS int > LANGUAGE java AS 'return (a == null || b == null) ? 0 : ( a > b ? a : b);'; > SELECT * FROM test_table; > UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’; This works > SELECT * FROM test_table; > UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’; This > does not work > > Results: > SyntaxException: message="line 1:39 no viable alternative at input ',' (UPDATE test_table > SET data=max_int([data],...)"> > > > > So I am wondering if I am missing something here - or if this is just a > problem with testing with cqlsh. If a UDF is not allowed to be passed a > value from the row that is being updated, then what exactly is the use case > for having a UDF in an UPDATE? If all arguments to the UDF have to be > supplied by the client, then the client might as well perform the > function. Using a UDF in an UPDATE would seem to make the most sense only > if the row data could be accessed, since it could be used to bypass the > need for a SELECT by a client to read the values and perform the operation > on them before doing the UPDATE. > > > Thank you, > Kim Liu > > > Kim Liu > Sr. Software Engineer > k...@edgewaternetworks.com > ___ > “Nothing in the world is more dangerous than sincere ignorance and > conscientious stupidity.” -- Martin Luther King Jr. > > > > >