Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of MEMORY_ONLY for this as it could potentially recompute shards if it couldn't entirely cache in memory.)
Thanks for the help, everybody!! On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson <ever...@nuna.com> wrote: > > > On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <subhash.sri...@gmail.com> > wrote: > >> Hi, >> >> We use monotonically_increasing_id() as well, but just cache the table >> first like Ankur suggested. With that method, we get the same keys in all >> derived tables. >> > > Ah, okay, awesome. Let me give that a go. > > > >> >> Thanks, >> Subhash >> >> Sent from my iPhone >> >> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ever...@nuna.com.INVALID> >> wrote: >> >> Hi, >> >> Thanks, but that's using a random UUID. Certainly unlikely to have >> collisions, but not guaranteed. >> >> I'd rather prefer something like monotonically_increasing_id or RDD's >> zipWithUniqueId but with better behavioral characteristics -- so they don't >> surprise people when 2+ outputs derived from an original table end up not >> having the same IDs for the same rows, anymore. >> >> It seems like this would be possible under the covers, but would have the >> performance penalty of needing to do perhaps a count() and then also a >> checkpoint. >> >> I was hoping there's a better way. >> >> >> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <secs...@gmail.com> wrote: >> >>> http://stackoverflow.com/questions/37231616/add-a-new-column >>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator >>> >>> >>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson < >>> ever...@nuna.com.invalid> wrote: >>> >>>> Hi, >>>> >>>> What's the best way to assign a truly unique row ID (rather than a >>>> hash) to a DataFrame/Dataset? >>>> >>>> I originally thought that functions.monotonically_increasing_id would >>>> do this, but it seems to have a rather unfortunate property that if you add >>>> it as a column to table A and then derive tables X, Y, Z and save those, >>>> the row ID values in X, Y, and Z may end up different. I assume this is >>>> because it delays the actual computation to the point where each of those >>>> tables is computed. >>>> >>>> >>> >>> >>> -- >>> >>> -- >>> Thanks, >>> >>> Tim >>> >> >> >