[GitHub] [iceberg] BsoBird opened a new issue, #8624: Iceberg does not work with other types of data lakes

via GitHub Sat, 23 Sep 2023 08:28:06 -0700


BsoBird opened a new issue, #8624:
URL: https://github.com/apache/iceberg/issues/8624


   ### Apache Iceberg version
   
   1.3.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have found that when I configure multiple Catalogs in the same 
SparkSession, in some cases Iceberg does not work.
   Example:
   ```
   ---iceberg catalog
   spark.sql.catalog.datacenter org.apache.iceberg.spark.SparkCatalog
   spark.sql.catalog.datacenter.type    hadoop
   spark.sql.catalog.datacenter.warehouse       /iceberg-catalog/warehouse
   
   --paimon catalog
   spark.sql.catalog.paimon     org.apache.paimon.spark.SparkCatalog
   spark.sql.catalog.paimon.warehouse   hdfs:///paimon/warehouse
   spark.jars       
/data/kyuubi/spark_aux_lib/paimon-spark-3.3-0.6-20230922.002014-17.jar
   ```
   当我执行在ICEBERG CATALOG下执行MERGE INTO语句时,报错信息如下:
   ```
   org.apache.spark.SparkException: The Spark SQL phase planning failed with an 
internal error. Please, fill a bug report in, and provide the full stack trace.
        at 
org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500)
        at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184)
        at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:145)
        at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:138)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:158)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185)
        at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184)
        at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:158)
        at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:151)
        at 
org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:204)
        at 
org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:249)
        at 
org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:218)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
        at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:83)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at 
org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:155)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at 
org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:139)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:78)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:100)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.AssertionError: assertion failed: No plan for 
ReplaceIcebergData RelationV2[data_from#13985, partner#13986, plat_code#13987, 
uni_shop_id#13988, category_id#13989, parent_category_id#13990, 
category_name#13991, root_cid#13992, is_leaf#13993, tenant#13994, 
last_sync#13995] datacenter.dwd.b_std_category, 
IcebergWrite(table=datacenter.dwd.b_std_category, format=ORC)
   +- Sort [icebergbuckettransform(64, uni_shop_id#14019) ASC NULLS FIRST], 
false
      +- RepartitionByExpression [icebergbuckettransform(64, 
uni_shop_id#14019)], 12288
         +- Project [data_from#14016, partner#14017, plat_code#14018, 
uni_shop_id#14019, category_id#14020, parent_category_id#14021, 
category_name#14022, root_cid#14023, is_leaf#14024, tenant#14025, 
last_sync#14026]
            +- MergeRows[data_from#14016, partner#14017, plat_code#14018, 
uni_shop_id#14019, category_id#14020, parent_category_id#14021, 
category_name#14022, root_cid#14023, is_leaf#14024, tenant#14025, 
last_sync#14026, _file#14027]
               +- Join FullOuter, ((uni_shop_id#13979 = uni_shop_id#13988) AND 
(category_id#13980 = category_id#13989)), leftHint=(strategy=no_broadcast_hash)
                  :- NoStatsUnaryNode
                  :  +- Project [data_from#13985, partner#13986, 
plat_code#13987, uni_shop_id#13988, category_id#13989, 
parent_category_id#13990, category_name#13991, root_cid#13992, is_leaf#13993, 
tenant#13994, last_sync#13995, _file#14010, true AS __row_from_target#14013, 
monotonically_increasing_id() AS __row_id#14014L]
                  :     +- Filter dynamicpruning#14075 [_file#14010]
                  :        :  +- Project [_file#14074]
                  :        :     +- Join LeftSemi, ((uni_shop_id#13979 = 
uni_shop_id#14066) AND (category_id#13980 = category_id#14067))
                  :        :        :- Filter isnotnull(category_id#14067)
                  :        :        :  +- RelationV2[uni_shop_id#14066, 
category_id#14067, _file#14074] datacenter.dwd.b_std_category
                  :        :        +- Project [uni_shop_id#13979, 
category_id#13980]
                  :        :           +- Filter ((rank#13984 = 1) AND 
(isnotnull(uni_shop_id#13979) AND isnotnull(category_id#13980)))
                  :        :              +- Window [row_number() 
windowspecdefinition(shop_id#14000, cid#14001, modified#14004 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#13984], [shop_id#14000, cid#14001], [modified#14004 DESC NULLS LAST]
                  :        :                 +- Project [shop_id#14000 AS 
uni_shop_id#13979, cid#14001 AS category_id#13980, shop_id#14000, cid#14001, 
modified#14004]
                  :        :                    +- HiveTableRelation 
[`dw_base_temp`.`category_analyse_result`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: 
[data_from#13997, partner#13998, plat_code#13999, shop_id#14000, cid#14001, 
parent_cid#14002, nam..., Partition Cols: []]
                  :        +- RelationV2[data_from#13985, partner#13986, 
plat_code#13987, uni_shop_id#13988, category_id#13989, 
parent_category_id#13990, category_name#13991, root_cid#13992, is_leaf#13993, 
tenant#13994, last_sync#13995, _file#14010] datacenter.dwd.b_std_category
                  +- Project [data_from#14028, partner#14029, plat_code#14030, 
uni_shop_id#13979, category_id#13980, parent_category_id#13981, 
category_name#13982, root_cid#14036, is_leaf#14037, tenant#14038, 
last_sync#13983, true AS __row_from_source#14015]
                     +- Filter (rank#13984 = 1)
                        +- Window [row_number() 
windowspecdefinition(shop_id#14031, cid#14032, modified#14035 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#13984], [shop_id#14031, cid#14032], [modified#14035 DESC NULLS LAST]
                           +- Project [data_from#14028, partner#14029, 
plat_code#14030, shop_id#14031 AS uni_shop_id#13979, cid#14032 AS 
category_id#13980, parent_cid#14033 AS parent_category_id#13981, name#14034 AS 
category_name#13982, root_cid#14036, is_leaf#14037, tenant#14038, 
modified#14035 AS last_sync#13983, shop_id#14031, cid#14032, modified#14035]
                              +- HiveTableRelation 
[`dw_base_temp`.`category_analyse_result`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: 
[data_from#14028, partner#14029, plat_code#14030, shop_id#14031, cid#14032, 
parent_cid#14033, nam..., Partition Cols: []]
   
        at scala.Predef$.assert(Predef.scala:223)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
        at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
        at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
        at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
        at 
org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:459)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:145)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185)
        at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
        ... 55 more
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] BsoBird opened a new issue, #8624: Iceberg does not work with other types of data lakes

Reply via email to