Re: Assigning a unique row ID

Subhash Sriram Fri, 07 Apr 2017 20:04:59 -0700

Hi,

We use monotonically_increasing_id() as well, but just cache the table first 
like Ankur suggested. With that method, we get the same keys in all derived 
tables.


Thanks,
Subhash

Sent from my iPhone

> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ever...@nuna.com.INVALID> wrote:
> 
> Hi,
> 
> Thanks, but that's using a random UUID. Certainly unlikely to have 
> collisions, but not guaranteed.
> 
> I'd rather prefer something like monotonically_increasing_id or RDD's 
> zipWithUniqueId but with better behavioral characteristics -- so they don't 
> surprise people when 2+ outputs derived from an original table end up not 
> having the same IDs for the same rows, anymore.
> 
> It seems like this would be possible under the covers, but would have the 
> performance penalty of needing to do perhaps a count() and then also a 
> checkpoint.
> 
> I was hoping there's a better way.
> 
> 
>> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <secs...@gmail.com> wrote:
>> http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>> 
>> 
>>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <ever...@nuna.com.invalid> 
>>> wrote:
>>> Hi,
>>> 
>>> What's the best way to assign a truly unique row ID (rather than a hash) to 
>>> a DataFrame/Dataset?
>>> 
>>> I originally thought that functions.monotonically_increasing_id would do 
>>> this, but it seems to have a rather unfortunate property that if you add it 
>>> as a column to table A and then derive tables X, Y, Z and save those, the 
>>> row ID values in X, Y, and Z may end up different. I assume this is because 
>>> it delays the actual computation to the point where each of those tables is 
>>> computed.
>>> 
>> 
>> 
>> 
>> -- 
>> 
>> --
>> Thanks,
>> 
>> Tim
>

Re: Assigning a unique row ID

Reply via email to