Re: Generating unique id for a column in Row without breaking into RDD and joining back

Mike Metzger Fri, 05 Aug 2016 09:35:40 -0700

Tony -

   From my testing this is built with performance in mind.  It's a 64-bit value 
split between the partition id (upper 31 bits ~1billion) and the id counter 
within a partition (lower 33 bits ~8 billion).  There shouldn't be any added 
communication between the executors and the driver for that.


I've been toying with an implementation that allows you to specify the split 
for better control along with a start value. 

Thanks

Mike

> On Aug 5, 2016, at 11:07 AM, Tony Lane <tonylane....@gmail.com> wrote:
> 
> Mike.
> 
> I have figured how to do this .  Thanks for the suggestion. It works great.  
> I am trying to figure out the performance impact of this. 
> 
> thanks again
> 
> 
>> On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane....@gmail.com> wrote:
>> @mike  - this looks great. How can i do this in java ?   what is the 
>> performance implication on a large dataset  ? 
>> 
>> @sonal  - I can't have a collision in the values. 
>> 
>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com> 
>>> wrote:
>>> You can use the monotonically_increasing_id method to generate guaranteed 
>>> unique (but not necessarily consecutive) IDs.  Calling something like:
>>> 
>>> df.withColumn("id", monotonically_increasing_id())
>>> 
>>> You don't mention which language you're using but you'll need to pull in 
>>> the sql.functions library.
>>> 
>>> Mike
>>> 
>>>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>>> 
>>>> Ayan - basically i have a dataset with structure, where bid are unique 
>>>> string values
>>>> 
>>>> bid: String
>>>> val : integer
>>>> 
>>>> I need unique int values for these string bid''s to do some processing in 
>>>> the dataset
>>>> 
>>>> like 
>>>> 
>>>> id:int   (unique integer id for each bid)
>>>> bid:String
>>>> val:integer
>>>> 
>>>> 
>>>> 
>>>> -Tony
>>>> 
>>>>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>> Hi
>>>>> 
>>>>> Can you explain a little further? 
>>>>> 
>>>>> best
>>>>> Ayan
>>>>> 
>>>>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane....@gmail.com> 
>>>>>> wrote:
>>>>>> I have a row with structure like
>>>>>> 
>>>>>> identifier: String
>>>>>> value: int
>>>>>> 
>>>>>> All identifier are unique and I want to generate a unique long id for 
>>>>>> the data and get a row object back for further processing. 
>>>>>> 
>>>>>> I understand using the zipWithUniqueId function on RDD, but that would 
>>>>>> mean first converting to RDD and then joining back the RDD and dataset
>>>>>> 
>>>>>> What is the best way to do this ? 
>>>>>> 
>>>>>> -Tony 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>

Re: Generating unique id for a column in Row without breaking into RDD and joining back

Reply via email to