[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022969#comment-17022969 ] Rahul Kumar Challapalli commented on SPARK-30218: - [~dongjoon] I am not sure but I was pointing what the OP was asking. Since we don't disambiguate the columns in this case, should we keep this issue as open? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022596#comment-17022596 ] Rahul Kumar Challapalli commented on SPARK-30218: - We currently are detecting that there is a self-join, but the OP seems to be asking about why spark doesn't disambiguate the columns. So I am not sure if we can close this issue. Thoughts? > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018828#comment-17018828 ] Rahul Kumar Challapalli commented on SPARK-30332: - If you cannot narrow down the problem, would you be able to provide the dataset and logs? > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised, > GB_position.lkupColla