Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
If I understand your question correctly, the current implementation doesn't allow a starting value, but it's easy enough to pull off with something like: val startval = 1 df.withColumn('id', monotonicallyIncreasingId + startval) Two points - your test shows what happens with a single partition.

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Should be pretty much the same code for Scala - import java.util.UUID UUID.randomUUID If you need it as a UDF, just wrap it accordingly. Mike On Fri, Aug 5, 2016 at 11:38 AM, Mich Talebzadeh wrote: > On the same token can one generate a UUID like below in Hive > >

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
This is a UDF written for Hive to monolithically increment a column by 1 http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFRowSequence.java package org.apache.hadoop.hive.contrib.udf; import org.apache.hadoop.hive.ql.exec.Description; import

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
On the same token can one generate a UUID like below in Hive hive> select reflect("java.util.UUID", "randomUUID"); OK 587b1665-b578-4124-8bf9-8b17ccb01fe7 thx Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Not that I've seen, at least not in any worker independent way. To guarantee consecutive values you'd have to create a udf or some such that provided a new row id. This probably isn't an issue on small data sets but would cause a lot of added communication on larger clusters / datasets. Mike

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Tony - From my testing this is built with performance in mind. It's a 64-bit value split between the partition id (upper 31 bits ~1billion) and the id counter within a partition (lower 33 bits ~8 billion). There shouldn't be any added communication between the executors and the driver for

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
Thanks Mike for this. This is Scala. As expected it adds the id column to the end of the column list starting from 0 0 scala> val df = ll_18740868.withColumn("id", monotonically_increasing_id()).show (2) +---+---+-+-+---

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread janardhan shetty
Mike, Any suggestions on doing it for consequitive id's? On Aug 5, 2016 9:08 AM, "Tony Lane" wrote: > Mike. > > I have figured how to do this . Thanks for the suggestion. It works > great. I am trying to figure out the performance impact of this. > > thanks again > > >

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Mike. I have figured how to do this . Thanks for the suggestion. It works great. I am trying to figure out the performance impact of this. thanks again On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane wrote: > @mike - this looks great. How can i do this in java ? what

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
@mike - this looks great. How can i do this in java ? what is the performance implication on a large dataset ? @sonal - I can't have a collision in the values. On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger wrote: > You can use the monotonically_increasing_id

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
You can use the monotonically_increasing_id method to generate guaranteed unique (but not necessarily consecutive) IDs. Calling something like: df.withColumn("id", monotonically_increasing_id()) You don't mention which language you're using but you'll need to pull in the sql.functions

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Sonal Goyal
Hi Tony, Would hash on the bid work for you? hash(cols: Column *): Column [image: Permalink]

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Ayan - basically i have a dataset with structure, where bid are unique string values bid: String val : integer I need unique int values for these string bid''s to do some processing in the dataset like id:int (unique integer id for each bid) bid:String val:integer -Tony On Fri, Aug 5,

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread ayan guha
Hi Can you explain a little further? best Ayan On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane wrote: > I have a row with structure like > > identifier: String > value: int > > All identifier are unique and I want to generate a unique long id for the > data and get a row

Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
I have a row with structure like identifier: String value: int All identifier are unique and I want to generate a unique long id for the data and get a row object back for further processing. I understand using the zipWithUniqueId function on RDD, but that would mean first converting to RDD and