Re: Assigning a unique row ID

2017-04-10 Thread Everett Anderson
Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of MEMORY_ONLY for this as it could potentially recompute shards if it couldn't entirely cache in memory.) Thanks for the help, everybody!! On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson wrote: > > > On

Re: Assigning a unique row ID

2017-04-08 Thread Everett Anderson
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram wrote: > Hi, > > We use monotonically_increasing_id() as well, but just cache the table > first like Ankur suggested. With that method, we get the same keys in all > derived tables. > Ah, okay, awesome. Let me give that a

Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram
Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote: > > Hi,

Re: Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi, Thanks, but that's using a random UUID. Certainly unlikely to have collisions, but not guaranteed. I'd rather prefer something like monotonically_increasing_id or RDD's zipWithUniqueId but with better behavioral characteristics -- so they don't surprise people when 2+ outputs derived from an

Re: Assigning a unique row ID

2017-04-07 Thread Ankur Srivastava
You can use zipWithIndex or the approach Tim suggested or even the one you are using but I believe the issue is that tableA is being materialized every time you for the new transformations. Are you caching/persisting the table A? If you do that you should not see this behavior. Thanks Ankur On

Re: Assigning a unique row ID

2017-04-07 Thread Tim Smith
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson wrote: > Hi, > > What's the best way to assign a truly unique row ID (rather than a hash) > to a

Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi, What's the best way to assign a truly unique row ID (rather than a hash) to a DataFrame/Dataset? I originally thought that functions.monotonically_increasing_id would do this, but it seems to have a rather unfortunate property that if you add it as a column to table A and then derive tables