On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:

> This isn't about using the same UUID though. It's about the timestamp bits
> in the UUID.
>
> What the use case is for generating multiple UUIDs in a single row? Why do
> you need to extract the timestamp out of both?
> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com
> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>
>>
>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com
>> <javascript:_e(%7B%7D,'cvml','sylv...@datastax.com');>> wrote:
>>
>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>>>
>>>>
>>>> I am not sure you saw my reply on thread but I believe everyone's needs
>>>> can be met I will copy that here:
>>>>
>>>
>>> I saw it, but the real problem that was raised initially was not that of
>>> UDF and of allowing both behavior. It's a matter of people being confused
>>> by the behavior of a non-UDF function, now(), and suggesting it should be
>>> changed.
>>>
>>> The Hive idea is interesting I guess, and we can switch to discussing
>>> that, but it's a different problem really and I'm not a fond of derailing
>>> threads. I will just note though that if we're not talking about a
>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>> statement, then there is much much more trivial solution: generate it
>>> client side. The `now()` function is a small convenience but there is
>>> nothing you cannot do without it client side, and that actually basically
>>> stands for almost any use of (non aggregate) function in Cassandra
>>> currently.
>>>
>>>
>>>>
>>>>
>>>> "Food for thought: Hive's UDFs introduced an annotation  
>>>> @UDFType(deterministic
>>>> = false)
>>>>
>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-
>>>> map-and-reduce-side-in-hive/
>>>>
>>>> The effect is the query planner can see when such a UDF is in use and
>>>> determine the value once at the start of a very long query."
>>>>
>>>> Essentially hive had a similar if not identical problem, during a long
>>>> running distributed process like map/reduce some users wanted the semantics
>>>> of:
>>>>
>>>> 1) Each call should have a new timestamps
>>>>
>>>> While other users wanted the semantics of:
>>>>
>>>> 2) Each call should generate the same timestamp
>>>>
>>>> The solution implemented was to add an annotation to udf such that the
>>>> query planner would pick up the annotation and act accordingly.
>>>>
>>>> (Here is a related issue https://issues.apache.
>>>> org/jira/browse/HIVE-1986
>>>>
>>>> As a result you can essentially implement two UDFS
>>>>
>>>> @UDFType(deterministic = false)
>>>> public class UDFNow
>>>>
>>>> and for the other people
>>>>
>>>> @UDFType(deterministic = true)
>>>> public class UDFNowOnce extends UDFNow
>>>>
>>>> Both user cases are met in a sensible way.
>>>>
>>>
>>>
>> The `now()` function is a small convenience but there is nothing you
>> cannot do without it client side, and that actually basically stands for
>> almost any use of (non aggregate) function in Cassandra currently.
>>
>> Casandra's changing philosophy over which entity should create such
>> information client/server/driver does not make this problem easy.
>>
>> If you take into account that you have users who do not understand all
>> the intricacy of uuid the problem is compounded. IE How does one generate a
>> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
>> not super easy information to find. Maybe you find a stack overflow post
>> that actually gives bad advice etc.
>>
>> Many times in Cassandra you are using a uuid because you do not have a
>> unique key in the insert and you wish to create one. If you are inserting
>> more then a single record using that same UUID and you do not want the
>> burden of wanting to do it yourself you would have to do write>>read>>write
>> which is an anti-pattern.
>>
>
Not multiple ids for a single row. The same id for multiple inserts in a
batch.

For example lets say I have an application where my data has no unique key.

Table poke
Poker, pokee, time

Suppose i consume pokes from kafka build a batch of 30k and insert them.
You probably want to denormalize into two tables:
Primary key (poker, time)
Primary key (pokee,time)

It makes sense that they all have the same uuid if you want it to be the
uuid of the batch. This would make it easy to correlate all the events.
Easy to delete them all as well.

The do it client side argument is totally valid, but has been a
justification for not adding features many of which are eventually added
anyway.




-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Reply via email to