Re: Generating unique id for a column in Row without breaking into RDD and joining back

Mike Metzger Fri, 05 Aug 2016 12:27:07 -0700

Should be pretty much the same code for Scala -

import java.util.UUID
UUID.randomUUID


If you need it as a UDF, just wrap it accordingly.

Mike

On Fri, Aug 5, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> On the same token can one generate  a UUID like below in Hive
>
> hive> select reflect("java.util.UUID", "randomUUID");
> OK
> 587b1665-b578-4124-8bf9-8b17ccb01fe7
>
> thx
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 August 2016 at 17:34, Mike Metzger <m...@flexiblecreations.com>
> wrote:
>
>> Tony -
>>
>>    From my testing this is built with performance in mind.  It's a 64-bit
>> value split between the partition id (upper 31 bits ~1billion) and the id
>> counter within a partition (lower 33 bits ~8 billion).  There shouldn't be
>> any added communication between the executors and the driver for that.
>>
>> I've been toying with an implementation that allows you to specify the
>> split for better control along with a start value.
>>
>> Thanks
>>
>> Mike
>>
>> On Aug 5, 2016, at 11:07 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>
>> Mike.
>>
>> I have figured how to do this .  Thanks for the suggestion. It works
>> great.  I am trying to figure out the performance impact of this.
>>
>> thanks again
>>
>>
>> On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane....@gmail.com> wrote:
>>
>>> @mike  - this looks great. How can i do this in java ?   what is the
>>> performance implication on a large dataset  ?
>>>
>>> @sonal  - I can't have a collision in the values.
>>>
>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com
>>> > wrote:
>>>
>>>> You can use the monotonically_increasing_id method to generate
>>>> guaranteed unique (but not necessarily consecutive) IDs.  Calling something
>>>> like:
>>>>
>>>> df.withColumn("id", monotonically_increasing_id())
>>>>
>>>> You don't mention which language you're using but you'll need to pull
>>>> in the sql.functions library.
>>>>
>>>> Mike
>>>>
>>>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>>>
>>>> Ayan - basically i have a dataset with structure, where bid are unique
>>>> string values
>>>>
>>>> bid: String
>>>> val : integer
>>>>
>>>> I need unique int values for these string bid''s to do some processing
>>>> in the dataset
>>>>
>>>> like
>>>>
>>>> id:int   (unique integer id for each bid)
>>>> bid:String
>>>> val:integer
>>>>
>>>>
>>>>
>>>> -Tony
>>>>
>>>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Can you explain a little further?
>>>>>
>>>>> best
>>>>> Ayan
>>>>>
>>>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have a row with structure like
>>>>>>
>>>>>> identifier: String
>>>>>> value: int
>>>>>>
>>>>>> All identifier are unique and I want to generate a unique long id for
>>>>>> the data and get a row object back for further processing.
>>>>>>
>>>>>> I understand using the zipWithUniqueId function on RDD, but that
>>>>>> would mean first converting to RDD and then joining back the RDD and 
>>>>>> dataset
>>>>>>
>>>>>> What is the best way to do this ?
>>>>>>
>>>>>> -Tony
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Generating unique id for a column in Row without breaking into RDD and joining back

Reply via email to