Re: Generating unique id for a column in Row without breaking into RDD and joining back

Mich Talebzadeh Fri, 05 Aug 2016 09:59:39 -0700

This is a UDF written for Hive to monolithically increment a column by 1

http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFRowSequence.java



package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting
from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
  private LongWritable result = new LongWritable();
  public UDFRowSequence() {
    result.set(0);
  }
  public LongWritable evaluate() {
    result.set(result.get() + 1);
    return result;
  }
}
// End UDFRowSequence.java

Is there equivalent of this one for Spark and in Scala as well

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 August 2016 at 17:38, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> On the same token can one generate  a UUID like below in Hive
>
> hive> select reflect("java.util.UUID", "randomUUID");
> OK
> 587b1665-b578-4124-8bf9-8b17ccb01fe7
>
> thx
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 August 2016 at 17:34, Mike Metzger <m...@flexiblecreations.com>
> wrote:
>
>> Tony -
>>
>>    From my testing this is built with performance in mind.  It's a 64-bit
>> value split between the partition id (upper 31 bits ~1billion) and the id
>> counter within a partition (lower 33 bits ~8 billion).  There shouldn't be
>> any added communication between the executors and the driver for that.
>>
>> I've been toying with an implementation that allows you to specify the
>> split for better control along with a start value.
>>
>> Thanks
>>
>> Mike
>>
>> On Aug 5, 2016, at 11:07 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>
>> Mike.
>>
>> I have figured how to do this .  Thanks for the suggestion. It works
>> great.  I am trying to figure out the performance impact of this.
>>
>> thanks again
>>
>>
>> On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane....@gmail.com> wrote:
>>
>>> @mike  - this looks great. How can i do this in java ?   what is the
>>> performance implication on a large dataset  ?
>>>
>>> @sonal  - I can't have a collision in the values.
>>>
>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com
>>> > wrote:
>>>
>>>> You can use the monotonically_increasing_id method to generate
>>>> guaranteed unique (but not necessarily consecutive) IDs.  Calling something
>>>> like:
>>>>
>>>> df.withColumn("id", monotonically_increasing_id())
>>>>
>>>> You don't mention which language you're using but you'll need to pull
>>>> in the sql.functions library.
>>>>
>>>> Mike
>>>>
>>>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>>>
>>>> Ayan - basically i have a dataset with structure, where bid are unique
>>>> string values
>>>>
>>>> bid: String
>>>> val : integer
>>>>
>>>> I need unique int values for these string bid''s to do some processing
>>>> in the dataset
>>>>
>>>> like
>>>>
>>>> id:int   (unique integer id for each bid)
>>>> bid:String
>>>> val:integer
>>>>
>>>>
>>>>
>>>> -Tony
>>>>
>>>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Can you explain a little further?
>>>>>
>>>>> best
>>>>> Ayan
>>>>>
>>>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have a row with structure like
>>>>>>
>>>>>> identifier: String
>>>>>> value: int
>>>>>>
>>>>>> All identifier are unique and I want to generate a unique long id for
>>>>>> the data and get a row object back for further processing.
>>>>>>
>>>>>> I understand using the zipWithUniqueId function on RDD, but that
>>>>>> would mean first converting to RDD and then joining back the RDD and 
>>>>>> dataset
>>>>>>
>>>>>> What is the best way to do this ?
>>>>>>
>>>>>> -Tony
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Generating unique id for a column in Row without breaking into RDD and joining back

Reply via email to