Hi all,
Just wondering what the actual logic governing DataFrame.dropDuplicates()
is? For example:

>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice',
age=5, height=80, itemsInPocket=[pen, pencil, paper]), \
Row(name='Alice', age=5, height=80), itemsInPocket=[pen, pencil, paper])\
Row(name='Alice', age=10, height=80, itemsInPocket=[pen, pencil])]).toDF()
>>> df.dropDuplicates().show() +---+------+-----+ |age|height| name|
itemsInPocket +---+------+-----+ ------------- | 5| 80|Alice| [pen, pencil,
paper] | 10| 80|Alice| [pen, pencil] +---+------+-----+ ------------- >>>
df.dropDuplicates(['name', 'height']).show() +---+------+-----+ |age|height|
name| itemsInPocket +---+------+-----+ ---------------- | 5| 80|Alice| [pen,
pencil, paper] +---+------+-----+
What determines which row is kept and which is deleted? First to appear? Or
random?

I would like to guarantee that the row with the longest list itemsInPocket
is kept. How can I do that?

Thanks,

James

Reply via email to