Hi all, Just wondering what the actual logic governing DataFrame.dropDuplicates() is? For example:
>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice', age=5, height=80, itemsInPocket=[pen, pencil, paper]), \ Row(name='Alice', age=5, height=80), itemsInPocket=[pen, pencil, paper])\ Row(name='Alice', age=10, height=80, itemsInPocket=[pen, pencil])]).toDF() >>> df.dropDuplicates().show() +---+------+-----+ |age|height| name| itemsInPocket +---+------+-----+ ------------- | 5| 80|Alice| [pen, pencil, paper] | 10| 80|Alice| [pen, pencil] +---+------+-----+ ------------- >>> df.dropDuplicates(['name', 'height']).show() +---+------+-----+ |age|height| name| itemsInPocket +---+------+-----+ ---------------- | 5| 80|Alice| [pen, pencil, paper] +---+------+-----+ What determines which row is kept and which is deleted? First to appear? Or random? I would like to guarantee that the row with the longest list itemsInPocket is kept. How can I do that? Thanks, James