Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of
MEMORY_ONLY for this as it could potentially recompute shards if it
couldn't entirely cache in memory.)
Thanks for the help, everybody!!
On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson wrote:
>
>
> On
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram
wrote:
> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>
Ah, okay, awesome. Let me give that a
Hi,
We use monotonically_increasing_id() as well, but just cache the table first
like Ankur suggested. With that method, we get the same keys in all derived
tables.
Thanks,
Subhash
Sent from my iPhone
> On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote:
>
> Hi,
Hi,
Thanks, but that's using a random UUID. Certainly unlikely to have
collisions, but not guaranteed.
I'd rather prefer something like monotonically_increasing_id or RDD's
zipWithUniqueId but with better behavioral characteristics -- so they don't
surprise people when 2+ outputs derived from an
You can use zipWithIndex or the approach Tim suggested or even the one you
are using but I believe the issue is that tableA is being materialized
every time you for the new transformations. Are you caching/persisting the
table A? If you do that you should not see this behavior.
Thanks
Ankur
On
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson
wrote:
> Hi,
>
> What's the best way to assign a truly unique row ID (rather than a hash)
> to a
Hi,
What's the best way to assign a truly unique row ID (rather than a hash) to
a DataFrame/Dataset?
I originally thought that functions.monotonically_increasing_id would do
this, but it seems to have a rather unfortunate property that if you add it
as a column to table A and then derive tables