[jira] [Commented] (SPARK-26336) left_anti join with Na Values
[ https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724098#comment-16724098 ] Marco Gaido commented on SPARK-26336: - [~csevilla] the point is always the same, ie. the presence of {{NULL}} (Python's None is SQL's NULL). And {{NULL = NULL}} returns {{NULL}}, not {{true}}. This is how every DB works. You can try it in MySQL, Postgres, whatever you prefer. > left_anti join with Na Values > - > > Key: SPARK-26336 > URL: https://issues.apache.org/jira/browse/SPARK-26336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Carlos >Priority: Major > > When I'm joining two dataframes with data that haves NA values, the left_anti > join don't work as well, cause don't detect registers with NA values. > Example: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate() > data = [(1,"Test"),(2,"Test"),(3,None)] > df1 = spark.createDataFrame(data,("id","columndata")) > df2 = spark.createDataFrame(data,("id","columndata")) > df_joined = df1.join(df2, df1.columns,'left_anti'){code} > df_joined have data, when two dataframe are the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26336) left_anti join with Na Values
[ https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724081#comment-16724081 ] Carlos commented on SPARK-26336: [~mgaido] I think I choose a bad objects to example that. data1 = { 'id':1, 'name':'Carlos' 'surname':'Sevilla' 'address':None 'Country':'ESP' } data2 = { 'id':1, 'name':'Carlos' 'surname':'Sevilla' 'address':None 'Country':'ESP' } That 2 variables, contains the SAME data. If I try to left_anti (with inner don't works too), he must return None results, none rows, cause both dataframe have exactly the same data. > left_anti join with Na Values > - > > Key: SPARK-26336 > URL: https://issues.apache.org/jira/browse/SPARK-26336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Carlos >Priority: Major > > When I'm joining two dataframes with data that haves NA values, the left_anti > join don't work as well, cause don't detect registers with NA values. > Example: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate() > data = [(1,"Test"),(2,"Test"),(3,None)] > df1 = spark.createDataFrame(data,("id","columndata")) > df2 = spark.createDataFrame(data,("id","columndata")) > df_joined = df1.join(df2, df1.columns,'left_anti'){code} > df_joined have data, when two dataframe are the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26336) left_anti join with Na Values
[ https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724052#comment-16724052 ] Marco Gaido commented on SPARK-26336: - That's correct because NULLs do not match. The usual implementation of ANTIJOIN in other DBs (eg. Postgres) is to do a left join and filter for the column on the right side being NULL. If you do so in your example 1 row is returned. > left_anti join with Na Values > - > > Key: SPARK-26336 > URL: https://issues.apache.org/jira/browse/SPARK-26336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Carlos >Priority: Major > > When I'm joining two dataframes with data that haves NA values, the left_anti > join don't work as well, cause don't detect registers with NA values. > Example: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate() > data = [(1,"Test"),(2,"Test"),(3,None)] > df1 = spark.createDataFrame(data,("id","columndata")) > df2 = spark.createDataFrame(data,("id","columndata")) > df_joined = df1.join(df2, df1.columns,'left_anti'){code} > df_joined have data, when two dataframe are the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org