Re: Generating unique id for a column in Row without breaking into RDD and joining back

Mike Metzger Fri, 05 Aug 2016 12:38:00 -0700

If I understand your question correctly, the current implementation doesn't
allow a starting value, but it's easy enough to pull off with something
like:


val startval = 1
df.withColumn('id', monotonicallyIncreasingId + startval)

Two points - your test shows what happens with a single partition. With
multiple partitions, the id values will inherently be much higher (due to
the partition id being the upper 31 bits of the value.)

The other note is that the startval in this case would need to be
communicated along with the job.  It may be worth defining it as a
broadcast variable and referencing it that way so there's less cluster
communication involved.  Honestly I doubt there's a lot of variance with
this small of a value but it's a good habit to get into.

Thanks

Mike

On Fri, Aug 5, 2016 at 11:33 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Mike for this.
>
> This is Scala. As expected it adds the id column to the end of the column
> list starting from 0 0
>
> scala> val df = ll_18740868.withColumn("id", 
> monotonically_increasing_id()).show
> (2)
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
> |transactiondate|transactiontype| sortcode|accountnumber|transac
> tiondescription|debitamount|creditamount|balance| id|
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
> |     2009-12-31|            CPT|'30-64-72|     18740868|  LTSB STH
> KENSINGT...|       90.0|        null|  400.0|  0|
> |     2009-12-31|            CPT|'30-64-72|     18740868|  LTSB CHELSEA
> (309...|       10.0|        null|  490.0|  1|
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
>
> Can one provide the starting value say 1?
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 August 2016 at 16:45, Mike Metzger <m...@flexiblecreations.com>
> wrote:
>
>> You can use the monotonically_increasing_id method to generate guaranteed
>> unique (but not necessarily consecutive) IDs.  Calling something like:
>>
>> df.withColumn("id", monotonically_increasing_id())
>>
>> You don't mention which language you're using but you'll need to pull in
>> the sql.functions library.
>>
>> Mike
>>
>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>
>> Ayan - basically i have a dataset with structure, where bid are unique
>> string values
>>
>> bid: String
>> val : integer
>>
>> I need unique int values for these string bid''s to do some processing in
>> the dataset
>>
>> like
>>
>> id:int   (unique integer id for each bid)
>> bid:String
>> val:integer
>>
>>
>>
>> -Tony
>>
>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Can you explain a little further?
>>>
>>> best
>>> Ayan
>>>
>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane....@gmail.com>
>>> wrote:
>>>
>>>> I have a row with structure like
>>>>
>>>> identifier: String
>>>> value: int
>>>>
>>>> All identifier are unique and I want to generate a unique long id for
>>>> the data and get a row object back for further processing.
>>>>
>>>> I understand using the zipWithUniqueId function on RDD, but that would
>>>> mean first converting to RDD and then joining back the RDD and dataset
>>>>
>>>> What is the best way to do this ?
>>>>
>>>> -Tony
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: Generating unique id for a column in Row without breaking into RDD and joining back

Reply via email to