Maurin,
I don't know the technical reason why but: try removing the 'limit 100'
part of your query. I was trying to do something similar the other week and
what I found is that each executor doesn't necessarily get the same 100
rows. Joins would fail or result with a bunch of nulls when keys
Guillermo,
I think you're after an associative algorithm where A is ultimately
associated with D, correct? Jakob would correct if that is a typo--a sort
would be all that is necessary in that case.
I believe you're looking for something else though, if I understand
correctly.
This seems like a
Hi all,
Just wondering what the actual logic governing DataFrame.dropDuplicates()
is? For example:
>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice',
age=5, height=80, itemsInPocket=[pen, pencil, paper]), \
Row(name='Alice', age=5, height=80), itemsInPocket=[pen,
Hi Arunkumar,
>From the scala documentation it's recommended to use the agg function for
performing any actual statistics programmatically on your data.
df.describe() is meant only for data exploration.
See Aggregator here: