[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-03-11 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394550#comment-16394550
 ] 

kevin yu commented on SPARK-19737:
--

[~LANDAIS Christophe], I submit a PR under  SPARK-23486, can you try and to see 
if it helps ?

 

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372289#comment-16372289
 ] 

Cheng Lian commented on SPARK-19737:


[~LANDAIS Christophe], I filed SPARK-23486 for this. Should be relatively 
straightforward to fix and I'd like to have a new contributor to try it as a 
starter task. Thanks for reporting!

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread LANDAIS Christophe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371377#comment-16371377
 ] 

LANDAIS Christophe commented on SPARK-19737:


Hello,

Migrating our application from spark 2.1.1 to spark 2.2.1, we see a major 
degradation in spark-SQL timing. One insert takes 5 seconds in 2.1.1 and 75 
seconds in spark 2.2.1. Looking in executor traces (I force configuration to 
one executor) , we see it takes time between spark.sql(“insert into”) is done 
and task is submitted to executor

My application traces :

2018-02-21 06:30:53 - Executor[1] Going to execute request …

2018-02-21 06:32:08 - Executor[1] request executed (tag: NO_TAG) (table: 
ca4mn.sys_4g_pcmd_mme_15min) (date: 20180221061500) - duration (s)  74.846

 

Executor trace :

18/02/21 06:30:52 INFO Executor: Finished task 0.0 in stage 3.0 (TID 1). 4675 
bytes result sent to driver  (landais note: this is the previous task that is 
terminated)

18/02/21 06:32:06 INFO CoarseGrainedExecutorBackend: Got assigned task 2

 

What is doing spark between 06:30:53 and 06:32:06 ? I have taken several thread 
dump in the container while execution was in progress, with a delay of 2 
seconds between thread dump. They are identical. Thread dump is put at the end 
of this comment.

Thread dump shows time is taken while verifying function exists: it is 
SPARK-19737 modification.

My SQL request contains 1000 functions because we are doing aggregation on many 
columns. Functions are like MAX, MIN, etc …

 

Please, can you perform a modification that improves this check ? For example: 
doing only one check for each different function ? Or why not introducing a 
spark parameter to bypass this check ?



Thread dump

178 "Executor[1]" #95 prio=5 os_prio=0 tid=0x7f587f355800 nid=0x7c runnable 
[0x7f57549f7000]

179    java.lang.Thread.State: RUNNABLE

180 at java.net.SocketInputStream.socketRead0(Native Method)

181 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

182 at java.net.SocketInputStream.read(SocketInputStream.java:171)

183 at java.net.SocketInputStream.read(SocketInputStream.java:141)

184 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)

185 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)

186 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

187 - locked <0x8913b110> (a java.io.BufferedInputStream)

188 at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

189 at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

190 at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

191 at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

192 at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

193 at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)

194 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:654)

195 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:641)

196 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1158)

197 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) 
   **

198 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 

199 at java.lang.reflect.Method.invoke(Method.java:498)

200 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)

201 at com.sun.proxy.$Proxy31.getDatabase(Unknown Source)

202 at 
org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1301)

203 at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1290)  **

204 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply$mcZ$sp(HiveClientImpl.scala:358)

205 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358)

206 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358)

207 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)

208 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)

209 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)

210 - locked <0x8900dd88> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)

211 at 

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896127#comment-15896127
 ] 

Apache Spark commented on SPARK-19737:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17168

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org