[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394550#comment-16394550 ] kevin yu commented on SPARK-19737: -- [~LANDAIS Christophe], I submit a PR under SPARK-23486, can you try and to see if it helps ? > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an undefined > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, it may take the analyzer a long time before > realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372289#comment-16372289 ] Cheng Lian commented on SPARK-19737: [~LANDAIS Christophe], I filed SPARK-23486 for this. Should be relatively straightforward to fix and I'd like to have a new contributor to try it as a starter task. Thanks for reporting! > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an undefined > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, it may take the analyzer a long time before > realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371377#comment-16371377 ] LANDAIS Christophe commented on SPARK-19737: Hello, Migrating our application from spark 2.1.1 to spark 2.2.1, we see a major degradation in spark-SQL timing. One insert takes 5 seconds in 2.1.1 and 75 seconds in spark 2.2.1. Looking in executor traces (I force configuration to one executor) , we see it takes time between spark.sql(“insert into”) is done and task is submitted to executor My application traces : 2018-02-21 06:30:53 - Executor[1] Going to execute request … 2018-02-21 06:32:08 - Executor[1] request executed (tag: NO_TAG) (table: ca4mn.sys_4g_pcmd_mme_15min) (date: 20180221061500) - duration (s) 74.846 Executor trace : 18/02/21 06:30:52 INFO Executor: Finished task 0.0 in stage 3.0 (TID 1). 4675 bytes result sent to driver (landais note: this is the previous task that is terminated) 18/02/21 06:32:06 INFO CoarseGrainedExecutorBackend: Got assigned task 2 What is doing spark between 06:30:53 and 06:32:06 ? I have taken several thread dump in the container while execution was in progress, with a delay of 2 seconds between thread dump. They are identical. Thread dump is put at the end of this comment. Thread dump shows time is taken while verifying function exists: it is SPARK-19737 modification. My SQL request contains 1000 functions because we are doing aggregation on many columns. Functions are like MAX, MIN, etc … Please, can you perform a modification that improves this check ? For example: doing only one check for each different function ? Or why not introducing a spark parameter to bypass this check ? Thread dump 178 "Executor[1]" #95 prio=5 os_prio=0 tid=0x7f587f355800 nid=0x7c runnable [0x7f57549f7000] 179 java.lang.Thread.State: RUNNABLE 180 at java.net.SocketInputStream.socketRead0(Native Method) 181 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 182 at java.net.SocketInputStream.read(SocketInputStream.java:171) 183 at java.net.SocketInputStream.read(SocketInputStream.java:141) 184 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 185 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) 186 at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 187 - locked <0x8913b110> (a java.io.BufferedInputStream) 188 at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) 189 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) 190 at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) 191 at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) 192 at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) 193 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) 194 at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:654) 195 at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:641) 196 at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1158) 197 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) ** 198 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 199 at java.lang.reflect.Method.invoke(Method.java:498) 200 at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) 201 at com.sun.proxy.$Proxy31.getDatabase(Unknown Source) 202 at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1301) 203 at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1290) ** 204 at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply$mcZ$sp(HiveClientImpl.scala:358) 205 at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358) 206 at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358) 207 at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) 208 at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) 209 at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) 210 - locked <0x8900dd88> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) 211 at
[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896127#comment-15896127 ] Apache Spark commented on SPARK-19737: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/17168 > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an invalid > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, then it may take the analyzer a long time > before realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org