Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of
MEMORY_ONLY for this as it could potentially recompute shards if it
couldn't entirely cache in memory.)

Thanks for the help, everybody!!

On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson <ever...@nuna.com> wrote:

>
>
> On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <subhash.sri...@gmail.com>
> wrote:
>
>> Hi,
>>
>> We use monotonically_increasing_id() as well, but just cache the table
>> first like Ankur suggested. With that method, we get the same keys in all
>> derived tables.
>>
>
> Ah, okay, awesome. Let me give that a go.
>
>
>
>>
>> Thanks,
>> Subhash
>>
>> Sent from my iPhone
>>
>> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ever...@nuna.com.INVALID>
>> wrote:
>>
>> Hi,
>>
>> Thanks, but that's using a random UUID. Certainly unlikely to have
>> collisions, but not guaranteed.
>>
>> I'd rather prefer something like monotonically_increasing_id or RDD's
>> zipWithUniqueId but with better behavioral characteristics -- so they don't
>> surprise people when 2+ outputs derived from an original table end up not
>> having the same IDs for the same rows, anymore.
>>
>> It seems like this would be possible under the covers, but would have the
>> performance penalty of needing to do perhaps a count() and then also a
>> checkpoint.
>>
>> I was hoping there's a better way.
>>
>>
>> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <secs...@gmail.com> wrote:
>>
>>> http://stackoverflow.com/questions/37231616/add-a-new-column
>>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>>>
>>>
>>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <
>>> ever...@nuna.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> What's the best way to assign a truly unique row ID (rather than a
>>>> hash) to a DataFrame/Dataset?
>>>>
>>>> I originally thought that functions.monotonically_increasing_id would
>>>> do this, but it seems to have a rather unfortunate property that if you add
>>>> it as a column to table A and then derive tables X, Y, Z and save those,
>>>> the row ID values in X, Y, and Z may end up different. I assume this is
>>>> because it delays the actual computation to the point where each of those
>>>> tables is computed.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Thanks,
>>>
>>> Tim
>>>
>>
>>
>

Reply via email to