[ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
shengyao piao resolved SPARK-21581. ----------------------------------- Resolution: Not A Problem > Spark 2.x distinct return incorrect result > ------------------------------------------ > > Key: SPARK-21581 > URL: https://issues.apache.org/jira/browse/SPARK-21581 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: shengyao piao > > Hi all > I'm using Spark2.x on cdh5.11 > I have a json file as follows. > ・sample.json > {code} > {"url": "http://example.hoge/staff1", "name": "staff1", "salary":600.0} > {"url": "http://example.hoge/staff2", "name": "staff2", "salary":700} > {"url": "http://example.hoge/staff3", "name": "staff3", "salary":800} > {"url": "http://example.hoge/staff4", "name": "staff4", "salary":900} > {"url": "http://example.hoge/staff5", "name": "staff5", "salary":1000.0} > {"url": "http://example.hoge/staff6", "name": "staff6", "salary":""} > {"url": "http://example.hoge/staff7", "name": "staff7", "salary":""} > {"url": "http://example.hoge/staff8", "name": "staff8", "salary":""} > {"url": "http://example.hoge/staff9", "name": "staff9", "salary":""} > {"url": "http://example.hoge/staff10", "name": "staff10", "salary":""} > {code} > And I try to read this file and distinct. > ・spark code > {code} > val s=spark.read.json("sample.json") > s.count > res13: Long = 10 > s.distinct.count > res14: Long = 6 < - It's should be 10 > {code} > I know the cause of incorrect result is by mixed type in salary field. > But when I try the same code in Spark 1.6 the result will be 10. > So I think it's a bug in Spark 2.x. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org