[jira] [Created] (SPARK-46091) [KUBERNETES] Respect the existing kubernetes container SPARK_LOCAL_DIRS env

2023-11-24 Thread Fei Wang (Jira)
Fei Wang created SPARK-46091:


 Summary: [KUBERNETES] Respect the existing kubernetes container 
SPARK_LOCAL_DIRS env
 Key: SPARK-46091
 URL: https://issues.apache.org/jira/browse/SPARK-46091
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Fei Wang


Respect the user defined SPARK_LOCAL_DIRS container env when setup local dirs.

 

For example, we use hostPath for spark local dir.

But we do not mount the sub disks directly to the pod, we mount a root path for 
spark driver/executor pod.

 

For example, the root path is `/hadoop`.

 

And there are sub disks under that, likes `hadoop/1, /hadoop/2, /hadoop/3, 
/hadoop4`.

 

And we want to define the SPARK_LOCAL_DIRS in the driver/executor pod env.

 

But now, the user specified SPARK_LOCAL_DIRS does not work.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Summary: Add working directory into classpath on the driver in K8S cluster 
mode  (was: Add current working directory into classpath on the driver in K8S 
cluster mode)

> Add working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory into classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory into classpath.

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory to classpath.


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory into classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory to classpath.

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, for  Kubernetes cluster mode, it places 


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory to classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, for  Kubernetes cluster mode, it places 

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

 


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, for  Kubernetes cluster mode, it places 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)
Fei Wang created SPARK-43540:


 Summary: Add current working directory into classpath on the 
driver in K8S cluster mode
 Key: SPARK-43540
 URL: https://issues.apache.org/jira/browse/SPARK-43540
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Fei Wang


In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Summary: [K8S] Mounts the hadoop config map on the executor pod  (was: 
[K8S] Mount hadoop config map in executor side)

> [K8S] Mounts the hadoop config map on the executor pod
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map will not be mounted on the executor pod.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mounts the hadoop config map on the executor pod
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map will not be mounted on the executor pod.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not mounted in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not mounted in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not mounted in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722674#comment-17722674
 ] 

Fei Wang commented on SPARK-43504:
--

gentle ping [~vanzin]  [~dongjoon] [~ifilonenko] 

> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)
Fei Wang created SPARK-43504:


 Summary: [K8S] Mount hadoop config map in executor side
 Key: SPARK-43504
 URL: https://issues.apache.org/jira/browse/SPARK-43504
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Fei Wang


Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores is not specified, how about 
treating request.cores as limit.cores?

  was:
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores 
as limit.cores?


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores is not specified, how about 
> treating request.cores as limit.cores?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores 
as limit.cores?

  was:
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores, how about treat 
> request.cores as limit.cores?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.

  was:
make limit.cores be able to be fallen back to request.cores

 

If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
> limit.cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)
Fei Wang created SPARK-43419:


 Summary: [K8S] Make limit.cores be able to be fallen back to 
request.cores
 Key: SPARK-43419
 URL: https://issues.apache.org/jira/browse/SPARK-43419
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Fei Wang


make limit.cores be able to be fallen back to request.cores

 

If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965735#comment-15965735
 ] 

Fei Wang commented on SPARK-20184:
--

Also use the master branch to test my test case:
1. Java version
192:spark wangfei$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

2. Spark starting cmd
192:spark wangfei$ bin/spark-sql --master local[4] --driver-memory 16g

3. test result:
sql: {code}
select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100;
  {code}
codegen on: about 1.4s
codegen off: about 0.6s


> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM:
---

try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table
{code}
select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100
{code}


was (Author: scwf):
try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM:
---

try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100


was (Author: scwf):
try this :
1. create table
[code]
val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

df.write.format("csv").saveAsTable("sum_table_50w_1")

[code]

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang commented on SPARK-20184:
--

try this :
1. create table
[code]
val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

df.write.format("csv").saveAsTable("sum_table_50w_1")

[code]

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-11 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965308#comment-15965308
 ] 

Fei Wang commented on SPARK-20184:
--

Tested with a smaller table 100,000 rows.
Codegen on: 2.6s
Codegen off: 1.5s

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-04 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Summary: performance regression for complex/long sql when enable whole 
stage codegen  (was: performance regression for complex/long sql when enable 
codegen)

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-04 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s


After some analysis i think this is related to the huge java method(a java 
method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much 
better(about 7s).

  was:
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
whole stage codege  off(spark.sql.codegen.wholeStage = false):6s


After some analysis i think this is related to the huge java method(a java 
method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much 
better(about 7s).


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-04 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
whole stage codege  off(spark.sql.codegen.wholeStage = false):6s


After some analysis i think this is related to the huge java method(a java 
method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much 
better(about 7s).

  was:
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on:40s
codegen off:6s


After some analysis i think this is related to the huge java method(a java 
method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much 
better(about 7s).


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codege  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on:40s
codegen off:6s


After some analysis i think this is related to the huge java method(a java 
method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much 
better(about 7s).

  was:
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on:40s
codegen off:6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on:40s
> codegen off:6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
Execute following sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on:40s
codegen off:6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on:40s
> codegen off:6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
The performance of following SQL get much worse in spark 2.x  in contrast with 
codegen off.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on: 40s
> codegen off:6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952123#comment-15952123
 ] 

Fei Wang commented on SPARK-20184:
--

[~r...@databricks.com] [~davies] May be we need split the huge method into 
small ones for codegen

> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute following sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute following sql with spark 2.x when codegen enabled,   the performance 
> is much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

Num of rows of aggtable is about 3500.


codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.


> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Summary: performance regression for complex/long sql when enable codegen  
(was: performance regression for complex sql when enable codegen)

> performance regression for complex/long sql when enable codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method(a java 
method thousand of lines) which generated when codegen on. And If i config 
-XX:-DontCompileHugeMethods the performance get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance get much better.


> performance regression for complex sql when enable codegen
> --
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method(a java 
> method thousand of lines) which generated when codegen on. And If i config 
> -XX:-DontCompileHugeMethods the performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance of codegen on get much better.


> performance regression for complex sql when enable codegen
> --
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method which 
> generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
> performance get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20184) performance regression for complex sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-20184:
-
Description: 
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
much worse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance of codegen on get much better.

  was:
Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
muchworse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance of codegen on get much better.


> performance regression for complex sql when enable codegen
> --
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
> much worse than the case when turn off codegen.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> codegen on: 40s
> codegen off:   6s
> after some analysis, i think this is related to the huge java method which 
> generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
> performance of codegen on get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20184) performance regression for complex sql when enable codegen

2017-04-01 Thread Fei Wang (JIRA)
Fei Wang created SPARK-20184:


 Summary: performance regression for complex sql when enable codegen
 Key: SPARK-20184
 URL: https://issues.apache.org/jira/browse/SPARK-20184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 1.6.0
Reporter: Fei Wang


Execute flowing sql with spark 2.x when codegen enabled,   the performance is 
muchworse than the case when turn off codegen.

SELECT
 sum(COUNTER_57) 
,sum(COUNTER_71) 
,sum(COUNTER_3)  
,sum(COUNTER_70) 
,sum(COUNTER_66) 
,sum(COUNTER_75) 
,sum(COUNTER_69) 
,sum(COUNTER_55) 
,sum(COUNTER_63) 
,sum(COUNTER_68) 
,sum(COUNTER_56) 
,sum(COUNTER_37) 
,sum(COUNTER_51) 
,sum(COUNTER_42) 
,sum(COUNTER_43) 
,sum(COUNTER_1)  
,sum(COUNTER_76) 
,sum(COUNTER_54) 
,sum(COUNTER_44) 
,sum(COUNTER_46) 
,DIM_1 
,DIM_2 
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

codegen on: 40s
codegen off:   6s


after some analysis, i think this is related to the huge java method which 
generated when codegen on. And If i config -XX:-DontCompileHugeMethods the 
performance of codegen on get much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-28 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Attachment: (was: executor broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-28 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Attachment: executor broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor broadcast.pdf, 
> executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-26 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Attachment: executor broadcast.pdf

update design doc

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-26 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Attachment: (was: executor broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516772#comment-15516772
 ] 

Fei Wang commented on SPARK-17556:
--


[~viirya] in this case how about notify driver to re-persist the rdd? 

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516732#comment-15516732
 ] 

Fei Wang commented on SPARK-17556:
--

That's a good point! 
In your solution, the broadcast rdd must persist first right?
How you handle the case for executors lost (all the replication of a piece 
lost)?


> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516639#comment-15516639
 ] 

Fei Wang edited comment on SPARK-17556 at 9/23/16 2:50 PM:
---

Yes, the main different is is does not introduce overhead to driver,  for 
broadcast the executor do need all the result of an RDD, i take a look at your 
PR,  i think you also collect all the result of that rdd to executor, right?


was (Author: scwf):
Yes, the main different is is does not introduce overhead to driver,  for 
broadcast the executor do need all the result of an RDD, i task a look of your 
PR,  i think you also collect all the result of that rdd to executor, right?

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516639#comment-15516639
 ] 

Fei Wang commented on SPARK-17556:
--

Yes, the main different is is does not introduce overhead to driver,  for 
broadcast the executor do need all the result of an RDD, i task a look of your 
PR,  i think you also collect all the result of that rdd to executor, right?

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Comment: was deleted

(was: Not correct, I just collect the broadcast ref to the driver but not the 
data:) .   

)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516545#comment-15516545
 ] 

Fei Wang commented on SPARK-17556:
--

Not correct, I just collect the broadcast ref to the driver but not the data:) 
.   



> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516546#comment-15516546
 ] 

Fei Wang commented on SPARK-17556:
--

Not correct, I just collect the broadcast ref to the driver but not the data:) 
.   



> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516386#comment-15516386
 ] 

Fei Wang edited comment on SPARK-17556 at 9/23/16 1:15 PM:
---

[~rxin] attached a design doc for the executor based broadcast. Will soon file 
a PR for this.

[~viirya] We have a executor based broadcast implementation in our inner 
product system which is based on the design doc i attached. Now we are 
contributing it to opensource, Can you help to review this, thanks.


was (Author: scwf):
[~rxin] attached a design doc for the executor based broadcast. Will soon file 
a PR for this.

[~viirya] We have a executor based broadcast implementation in our inner 
produce system which is based on the design doc i attached. Now we are 
contributing it to opensource, Can you help to review this, thanks.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516386#comment-15516386
 ] 

Fei Wang edited comment on SPARK-17556 at 9/23/16 1:05 PM:
---

[~rxin] attached a design doc for the executor based broadcast. Will soon file 
a PR for this.

[~viirya] We have a executor based broadcast implementation in our inner 
produce system which is based on the design doc i attached. Now we are 
contributing it to opensource, Can you help to review this, thanks.


was (Author: scwf):
[~rxin] attached a design doc for the executor based broadcast. Will soon file 
a PR for this.

[~viirya] We have a executor based broadcast implementation in our inner 
produce system which is based on the design doc i attached. Can you help to 
review this, thanks.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516386#comment-15516386
 ] 

Fei Wang commented on SPARK-17556:
--

[~rxin] attached a design doc for the executor based broadcast. Will soon file 
a PR for this.

[~viirya] We have a executor based broadcast implementation in our inner 
produce system which is based on the design doc i attached. Can you help to 
review this, thanks.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-23 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-17556:
-
Attachment: executor broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17644) The failed stage never resubmitted due to abort stage in another thread

2016-09-23 Thread Fei Wang (JIRA)
Fei Wang created SPARK-17644:


 Summary: The failed stage never resubmitted due to abort stage in 
another thread
 Key: SPARK-17644
 URL: https://issues.apache.org/jira/browse/SPARK-17644
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.0.0, 1.6.0
Reporter: Fei Wang


there is a race condition when FetchFailed and resubmit failed stage:
job1, job2 run in different threads, if job 1 failed 4 times due to fetchfailed 
and aborted, then job2 can not post ResubmitFailedStages becase the 
failedStages in DAGScheduler is not empty now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to

2016-01-10 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-12742:
-
Summary: org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to   
(was: org.apache.spark.sql.hive.LogicalPlanToSQLSuite failuer)

> org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to 
> ---
>
> Key: SPARK-12742
> URL: https://issues.apache.org/jira/browse/SPARK-12742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
> milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists

2016-01-10 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-12742:
-
   Due Date: 11/Jan/16
Component/s: SQL

> org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already 
> exists
> ---
>
> Key: SPARK-12742
> URL: https://issues.apache.org/jira/browse/SPARK-12742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
> milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists

2016-01-10 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-12742:
-
Summary: org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to 
Table already exists  (was: org.apache.spark.sql.hive.LogicalPlanToSQLSuite 
failure due to )

> org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already 
> exists
> ---
>
> Key: SPARK-12742
> URL: https://issues.apache.org/jira/browse/SPARK-12742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
> milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failuer

2016-01-10 Thread Fei Wang (JIRA)
Fei Wang created SPARK-12742:


 Summary: org.apache.spark.sql.hive.LogicalPlanToSQLSuite failuer
 Key: SPARK-12742
 URL: https://issues.apache.org/jira/browse/SPARK-12742
 Project: Spark
  Issue Type: Bug
Reporter: Fei Wang


[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
[info]   at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
[info]   at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
[info]   at 
org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[info]   at 
org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[info]   at 
org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12222) deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception

2015-12-08 Thread Fei Wang (JIRA)
Fei Wang created SPARK-1:


 Summary: deserialize RoaringBitmap using Kryo serializer throw 
Buffer underflow exception
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Fei Wang


here are some problems when deserialize RoaringBitmap. see the examples below:
run this piece of code
```
import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput}
import java.io.DataInput

class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
  override def readLong(): Long = input.readLong()
  override def readChar(): Char = input.readChar()
  override def readFloat(): Float = input.readFloat()
  override def readByte(): Byte = input.readByte()
  override def readShort(): Short = input.readShort()
  override def readUTF(): String = input.readString() // readString in kryo 
does utf8
  override def readInt(): Int = input.readInt()
  override def readUnsignedShort(): Int = input.readShortUnsigned()
  override def skipBytes(n: Int): Int = input.skip(n.toLong).toInt
  override def readFully(b: Array[Byte]): Unit = input.read(b)
  override def readFully(b: Array[Byte], off: Int, len: Int): Unit = 
input.read(b, off, len)
  override def readLine(): String = throw new 
UnsupportedOperationException("readLine")
  override def readBoolean(): Boolean = input.readBoolean()
  override def readUnsignedByte(): Int = input.readByteUnsigned()
  override def readDouble(): Double = input.readDouble()
}

class KryoOutputDataOutputBridge(output: KryoOutput) extends DataOutput {
  override def writeFloat(v: Float): Unit = output.writeFloat(v)
  // There is no "readChars" counterpart, except maybe "readLine", which is 
not supported
  override def writeChars(s: String): Unit = throw new 
UnsupportedOperationException("writeChars")
  override def writeDouble(v: Double): Unit = output.writeDouble(v)
  override def writeUTF(s: String): Unit = output.writeString(s) // 
writeString in kryo does UTF8
  override def writeShort(v: Int): Unit = output.writeShort(v)
  override def writeInt(v: Int): Unit = output.writeInt(v)
  override def writeBoolean(v: Boolean): Unit = output.writeBoolean(v)
  override def write(b: Int): Unit = output.write(b)
  override def write(b: Array[Byte]): Unit = output.write(b)
  override def write(b: Array[Byte], off: Int, len: Int): Unit = 
output.write(b, off, len)
  override def writeBytes(s: String): Unit = output.writeString(s)
  override def writeChar(v: Int): Unit = output.writeChar(v.toChar)
  override def writeLong(v: Long): Unit = output.writeLong(v)
  override def writeByte(v: Int): Unit = output.writeByte(v)
}
val outStream = new FileOutputStream("D:\\wfserde")
val output = new KryoOutput(outStream)
val bitmap = new RoaringBitmap
bitmap.add(1)
bitmap.add(3)
bitmap.add(5)
bitmap.serialize(new KryoOutputDataOutputBridge(output))
output.flush()
output.close()

val inStream = new FileInputStream("D:\\wfserde")
val input = new KryoInput(inStream)
val ret = new RoaringBitmap
ret.deserialize(new KryoInputDataInputBridge(input))

println(ret)
```

this will throw `Buffer underflow` error:
```
com.esotericsoftware.kryo.KryoException: Buffer underflow.
at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$6$KryoInputDataInputBridge$1.skipBytes
```

after same investigation,  i found this is caused by a bug of kryo's 
`Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) 
and we call this method in `KryoInputDataInputBridge`.

So i think we can fix this issue in this two ways:
1) upgrade the kryo version to 2.23.0 or 2.24.0, which has fix this bug in kryo 
(i am not sure the upgrade is safe in spark, can you check it? @davies )

2) we can bypass the  kryo's `Input.skip(long count)` by directly call another 
`skip` method in kryo's 
Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124),
 i.e. write the bug-fixed version of `Input.skip(long count)` in 
KryoInputDataInputBridge's `skipBytes` method:
```
   class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
  ...
  override def skipBytes(n: Int): Int = {
var remaining: Long = n
while (remaining > 0) {
  val skip = Math.min(Integer.MAX_VALUE, remaining).asInstanceOf[Int]
  input.skip(skip)
  remaining -= skip
}
n
 }
  ...
}
```



--
This 

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2015-09-08 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735995#comment-14735995
 ] 

Fei Wang commented on SPARK-4131:
-

can you try the example of my test suite to see if it works? 
https://github.com/apache/spark/pull/4380/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R535

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10030) Managed memory leak detected when cache table

2015-08-17 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699075#comment-14699075
 ] 

Fei Wang commented on SPARK-10030:
--

[~hyukjin.kwon] this is defenitly a spark sql bug, so we opened a jira.

 Managed memory leak detected when cache table
 -

 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: wangwei
Priority: Blocker

 I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the 
 steps bellow, then errors occured.
 1. create table cache_test(id int,  name string) stored as textfile ;
 2. load data local inpath 
 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
 cache_test;
 3. cache table test as select * from cache_test distribute by id;
 configuration:
 spark.driver.memory5g
 spark.executor.memory   28g
 spark.cores.max  21
 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
 67108864 bytes, TID = 434
 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
 434)
 java.util.NoSuchElementException: key not found: val_54
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698974#comment-14698974
 ] 

Fei Wang commented on SPARK-10030:
--

[~liancheng] can you have a look on this issue?

 Managed memory leak detected when cache table
 -

 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: wangwei
Priority: Blocker

 I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the 
 steps bellow, then errors occured.
 1. create table cache_test(id int,  name string) stored as textfile ;
 2. load data local inpath 
 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
 cache_test;
 3. cache table test as select * from cache_test distribute by id;
 configuration:
 spark.driver.memory5g
 spark.executor.memory   28g
 spark.cores.max  21
 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
 67108864 bytes, TID = 434
 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
 434)
 java.util.NoSuchElementException: key not found: val_54
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-14 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698124#comment-14698124
 ] 

Fei Wang commented on SPARK-9725:
-

good job :)

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-13 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696323#comment-14696323
 ] 

Fei Wang commented on SPARK-9725:
-

in some cases(the long table name) of ours it also show garbled string.

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-13 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696324#comment-14696324
 ] 

Fei Wang commented on SPARK-9725:
-

we test on suse and redhat, both has this issue.
Now we find some clue, will post here later.

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-13 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696449#comment-14696449
 ] 

Fei Wang commented on SPARK-9725:
-

[~davies] please see the comment of peizhongshuai

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-08-09 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679447#comment-14679447
 ] 

Fei Wang commented on SPARK-8890:
-

should this be part of tungsten?

 Reduce memory consumption for dynamic partition insert
 --

 Key: SPARK-8890
 URL: https://issues.apache.org/jira/browse/SPARK-8890
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.5.0


 Currently, InsertIntoHadoopFsRelation can run out of memory if the number of 
 table partitions is large. The problem is that we open one output writer for 
 each partition, and when data are randomized and when the number of 
 partitions is large, we open a large number of output writers, leading to OOM.
 The solution here is to inject a sorting operation once the number of active 
 partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-08 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662847#comment-14662847
 ] 

Fei Wang commented on SPARK-9725:
-

And i not set SPARK_PREPEND_CLASSES

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-07 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661383#comment-14661383
 ] 

Fei Wang commented on SPARK-9725:
-

how much memory you set to executor?

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-07 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661389#comment-14661389
 ] 

Fei Wang edited comment on SPARK-9725 at 8/7/15 6:21 AM:
-

i try the master again and reproduce this issue:

code
M151:/home/wf/spark # bin/spark-sql
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
spark-sql CREATE TABLE IF NOT EXISTS srcbasdfasdfasdf (key INT, value STRING);
OK
Time taken: 3.493 seconds
spark-sql CREATE TABLE IF NOT EXISTS src (key INT, value STRING);
OK
Time taken: 0.181 seconds
spark-sql show tables;
false
srcbasdffalse
Time taken: 0.211 seconds, Fetched 2 row(s)
spark-sql 
/code


was (Author: scwf):
i try the master again and reproduce this issue:

M151:/home/wf/spark # bin/spark-sql
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
spark-sql CREATE TABLE IF NOT EXISTS srcbasdfasdfasdf (key INT, value STRING);
OK
Time taken: 3.493 seconds
spark-sql CREATE TABLE IF NOT EXISTS src (key INT, value STRING);
OK
Time taken: 0.181 seconds
spark-sql show tables;
false
srcbasdffalse
Time taken: 0.211 seconds, Fetched 2 row(s)
spark-sql 


 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-07 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661389#comment-14661389
 ] 

Fei Wang edited comment on SPARK-9725 at 8/7/15 6:21 AM:
-

i try the master again and reproduce this issue:


M151:/home/wf/spark # bin/spark-sql
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
spark-sql CREATE TABLE IF NOT EXISTS srcbasdfasdfasdf (key INT, value STRING);
OK
Time taken: 3.493 seconds
spark-sql CREATE TABLE IF NOT EXISTS src (key INT, value STRING);
OK
Time taken: 0.181 seconds
spark-sql show tables;
false
srcbasdffalse
Time taken: 0.211 seconds, Fetched 2 row(s)
spark-sql 



was (Author: scwf):
i try the master again and reproduce this issue:

code
M151:/home/wf/spark # bin/spark-sql
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
spark-sql CREATE TABLE IF NOT EXISTS srcbasdfasdfasdf (key INT, value STRING);
OK
Time taken: 3.493 seconds
spark-sql CREATE TABLE IF NOT EXISTS src (key INT, value STRING);
OK
Time taken: 0.181 seconds
spark-sql show tables;
false
srcbasdffalse
Time taken: 0.211 seconds, Fetched 2 row(s)
spark-sql 
/code

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-07 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661389#comment-14661389
 ] 

Fei Wang commented on SPARK-9725:
-

i try the master again and reproduce this issue:

M151:/home/wf/spark # bin/spark-sql
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: backward-delete-word
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
[INFO] Unable to bind key for unsupported operation: up-history
[INFO] Unable to bind key for unsupported operation: down-history
spark-sql CREATE TABLE IF NOT EXISTS srcbasdfasdfasdf (key INT, value STRING);
OK
Time taken: 3.493 seconds
spark-sql CREATE TABLE IF NOT EXISTS src (key INT, value STRING);
OK
Time taken: 0.181 seconds
spark-sql show tables;
false
srcbasdffalse
Time taken: 0.211 seconds, Fetched 2 row(s)
spark-sql 


 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-06 Thread Fei Wang (JIRA)
Fei Wang created SPARK-9725:
---

 Summary: spark sql query string field return empty/garbled string
 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Priority: Blocker


to reproduce it:
1 deploy spark cluster mode, i use standalone mode locally
2 set executor memory = 32g, set following config in spark-default.xml
   spark.executor.memory36g 

3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead

2015-07-10 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-8968:

Summary: dynamic partitioning in spark sql performance issue due to the 
high GC overhead  (was: shuffled by the partition clomns when dynamic 
partitioning to optimize the memory overhead)

 dynamic partitioning in spark sql performance issue due to the high GC 
 overhead
 ---

 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang

 now the dynamic partitioning show the bad performance for big data due to the 
 GC/memory overhead.  this is because each task each partition now we open a 
 writer to write the data, this will cause many small files and high GC. We 
 can shuffle data by the partition columns so that each partition will have 
 ony one partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead

2015-07-10 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621935#comment-14621935
 ] 

Fei Wang commented on SPARK-8968:
-

changed, how about this?

 dynamic partitioning in spark sql performance issue due to the high GC 
 overhead
 ---

 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang

 now the dynamic partitioning show the bad performance for big data due to the 
 GC/memory overhead.  this is because each task each partition now we open a 
 writer to write the data, this will cause many small files and high GC. We 
 can shuffle data by the partition columns so that each partition will have 
 ony one partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8968) shuffled by the partition clomns when dynamic partitioning to optimize the memory overhead

2015-07-09 Thread Fei Wang (JIRA)
Fei Wang created SPARK-8968:
---

 Summary: shuffled by the partition clomns when dynamic 
partitioning to optimize the memory overhead
 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang


now the dynamic partitioning show the bad performance for big data due to the 
GC/memory overhead.  this is because each task each partition now we open a 
writer to write the data, this will cause many small files and high GC. We can 
shuffle data by the partition columns so that each partition will have ony one 
partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7173) Support YARN node label expressions for the application master

2015-06-07 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576547#comment-14576547
 ] 

Fei Wang commented on SPARK-7173:
-

is this resolved by spark-6470?

 Support YARN node label expressions for the application master
 --

 Key: SPARK-7173
 URL: https://issues.apache.org/jira/browse/SPARK-7173
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7866) print the format string in dataframe explain

2015-05-26 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang closed SPARK-7866.
---
Resolution: Won't Fix

Wrong print in intelli idea, this is really ok, not a problem.

 print the format string in dataframe explain
 

 Key: SPARK-7866
 URL: https://issues.apache.org/jira/browse/SPARK-7866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Priority: Trivial

 QueryExecution.toString give a format and clear string, so we print it in 
 DataFrame.explain method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7866) print the format string in dataframe explain

2015-05-26 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7866:
---

 Summary: print the format string in dataframe explain
 Key: SPARK-7866
 URL: https://issues.apache.org/jira/browse/SPARK-7866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


QueryExecution.toString give a format and clear string, so we print it in 
DataFrame.explain method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7866) print the format string in dataframe explain

2015-05-26 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559048#comment-14559048
 ] 

Fei Wang commented on SPARK-7866:
-

Get it, thanks Owen.

 print the format string in dataframe explain
 

 Key: SPARK-7866
 URL: https://issues.apache.org/jira/browse/SPARK-7866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Priority: Trivial

 QueryExecution.toString give a format and clear string, so we print it in 
 DataFrame.explain method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3323) yarn website's Tracking UI links to the Standby RM

2015-05-19 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang resolved SPARK-3323.
-
Resolution: Fixed

 yarn website's Tracking UI links to the Standby RM
 --

 Key: SPARK-3323
 URL: https://issues.apache.org/jira/browse/SPARK-3323
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Fei Wang

 Running a big application, sometimes will occur this situation:
 When clicking the Tracking UI of the running application, it links to the 
 Standby RM.
 With Info as fellow:
 This is standby RM.Redirecting to the current active RM: some address
 But actually the address of this website is the same with the some address



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6997) Convert StringType in LocalTableScan

2015-05-19 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang closed SPARK-6997.
---
Resolution: Won't Fix

 Convert StringType in LocalTableScan 
 -

 Key: SPARK-6997
 URL: https://issues.apache.org/jira/browse/SPARK-6997
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fei Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6997) Convert StringType in LocalTableScan

2015-05-19 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549924#comment-14549924
 ] 

Fei Wang commented on SPARK-6997:
-

this is not nessarry since we have convert the data from scala objects to 
catalyst rows / types when construct LocalRelation. I am closing this.

 Convert StringType in LocalTableScan 
 -

 Key: SPARK-6997
 URL: https://issues.apache.org/jira/browse/SPARK-6997
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fei Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7289) Combine Limit and Sort to avoid total ordering

2015-05-19 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang closed SPARK-7289.
---
Resolution: Won't Fix

use usually not write sql like this

 Combine Limit and Sort to avoid total ordering
 --

 Key: SPARK-7289
 URL: https://issues.apache.org/jira/browse/SPARK-7289
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang

 Optimize following sql
 select key from (select * from testData order by key) t limit 5
 from 
 == Parsed Logical Plan ==
 'Limit 5
  'Project ['key]
   'Subquery t
'Sort ['key ASC], true
 'Project [*]
  'UnresolvedRelation [testData], None
 == Analyzed Logical Plan ==
 Limit 5
  Project [key#0]
   Subquery t
Sort [key#0 ASC], true
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Optimized Logical Plan ==
 Limit 5
  Project [key#0]
   Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Physical Plan ==
 Limit 5
  Project [key#0]
   Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
 PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 to
 == Parsed Logical Plan ==
 'Limit 5
  'Project ['key]
   'Subquery t
'Sort ['key ASC], true
 'Project [*]
  'UnresolvedRelation [testData], None
 == Analyzed Logical Plan ==
 Limit 5
  Project [key#0]
   Subquery t
Sort [key#0 ASC], true
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
 == Optimized Logical Plan ==
 Project [key#0]
  Limit 5
   Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Physical Plan ==
 Project [key#0]
  TakeOrdered 5, [key#0 ASC]
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5240) Adding `createDataSourceTable` interface to Catalog

2015-05-19 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang closed SPARK-5240.
---
Resolution: Won't Fix

 Adding `createDataSourceTable` interface to Catalog
 ---

 Key: SPARK-5240
 URL: https://issues.apache.org/jira/browse/SPARK-5240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Fei Wang

 Adding `createDataSourceTable` interface to Catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7656) use CatalystConf in FunctionRegistry

2015-05-14 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7656:
---

 Summary: use CatalystConf in FunctionRegistry
 Key: SPARK-7656
 URL: https://issues.apache.org/jira/browse/SPARK-7656
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


should use CatalystConf in FunctionRegistry



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7631) treenode argString should not print children

2015-05-14 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7631:
---

 Summary: treenode argString should not print children
 Key: SPARK-7631
 URL: https://issues.apache.org/jira/browse/SPARK-7631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


spark-sql explain extended 

  select * from (  

  select key from src union all

  select key from src) t;

the spark plan will print children in argString  
 
== Physical Plan ==
Union[ HiveTableScan [key#1], (MetastoreRelation default, src, None), None,
 HiveTableScan [key#3], (MetastoreRelation default, src, None), None]
 HiveTableScan [key#1], (MetastoreRelation default, src, None), None
 HiveTableScan [key#3], (MetastoreRelation default, src, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2015-05-14 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7659:
---

 Summary: Sort by attributes that are not present in the SELECT 
clause when there is windowfunction analysis error
 Key: SPARK-7659
 URL: https://issues.apache.org/jira/browse/SPARK-7659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


flowing sql get error:
select month,
sum(product) over (partition by month)
from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6929) Alias for more complex expression causes attribute not been able to resolve

2015-05-09 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536627#comment-14536627
 ] 

Fei Wang commented on SPARK-6929:
-

SPARK SQL use c_0 as inner alias name, i think you can try with another alias 
name which is not c_${number}

 Alias for more complex expression causes attribute not been able to resolve
 ---

 Key: SPARK-6929
 URL: https://issues.apache.org/jira/browse/SPARK-6929
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michał Warecki
Priority: Critical

 I've extracted the minimal query that don't work with aliases. You can remove 
 tstudent expression ((tstudent((COUNT(g_0.test2_value) - 1)) from that query 
 and result will be the same. In exception you can see that c_0 is not 
 resolved but c_1 cause that problem.
 {code}
 SELECT g_0.test1 AS c_0, (AVG(g_0.test2) - ((tstudent((COUNT(g_0.test2_value) 
 - 1)) * stddev(g_0.test2_value)) / sqrt(convert(COUNT(g_0.test2), long AS 
 c_1 FROM sometable AS g_0 GROUP BY g_0.test1 ORDER BY c_0 LIMIT 502
 {code}
 cause exception:
 {code}
 Remote org.apache.spark.sql.AnalysisException: cannot resolve 'c_0' given 
 input columns c_0, c_1; line 1 pos 246
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 

[jira] [Created] (SPARK-7303) push down project if possible when the child is sort

2015-05-01 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7303:
---

 Summary: push down project if possible when the child is sort
 Key: SPARK-7303
 URL: https://issues.apache.org/jira/browse/SPARK-7303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


Optimize the case of `project(_, sort)` , a example is:

`select key from (select * from testData order by key) t`

optimize it from
```
== Parsed Logical Plan ==
'Project ['key]
 'Subquery t
  'Sort ['key ASC], true
   'Project [*]
'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Project [key#0]
 Subquery t
  Sort [key#0 ASC], true
   Project [key#0,value#1]
Subquery testData
 LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Optimized Logical Plan ==
Project [key#0]
 Sort [key#0 ASC], true
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
Project [key#0]
 Sort [key#0 ASC], true
  Exchange (RangePartitioning [key#0 ASC], 5), []
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
```

to 
```
== Parsed Logical Plan ==
'Project ['key]
 'Subquery t
  'Sort ['key ASC], true
   'Project [*]
'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Project [key#0]
 Subquery t
  Sort [key#0 ASC], true
   Project [key#0,value#1]
Subquery testData
 LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Optimized Logical Plan ==
Sort [key#0 ASC], true
 Project [key#0]
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
Sort [key#0 ASC], true
 Exchange (RangePartitioning [key#0 ASC], 5), []
  Project [key#0]
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7289) Combine Limit and Sort to avoid total ordering

2015-05-01 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-7289:

Description: 
Optimize following sql

select key from (select * from testData order by key) t limit 5

from 

== Parsed Logical Plan ==
'Limit 5
 'Project ['key]
  'Subquery t
   'Sort ['key ASC], true
'Project [*]
 'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Project [key#0]
  Subquery t
   Sort [key#0 ASC], true
Project [key#0,value#1]
 Subquery testData
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Optimized Logical Plan ==
Limit 5
 Project [key#0]
  Sort [key#0 ASC], true
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
== Physical Plan ==
Limit 5
 Project [key#0]
  Sort [key#0 ASC], true
   Exchange (RangePartitioning [key#0 ASC], 5), []
PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 

to

== Parsed Logical Plan ==
'Limit 5
 'Project ['key]
  'Subquery t
   'Sort ['key ASC], true
'Project [*]
 'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Project [key#0]
  Subquery t
   Sort [key#0 ASC], true
Project [key#0,value#1]
 Subquery testData
  LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Optimized Logical Plan ==
Project [key#0]
 Limit 5
  Sort [key#0 ASC], true
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
Project [key#0]
 TakeOrdered 5, [key#0 ASC]
  PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]

  was:
Optimize following sql
`select key from (select * from testData limit 5) t order by key limit 5`

optimize it from 
```
== Parsed Logical Plan ==
'Limit 5
 'Sort ['key ASC], true
  'Project ['key]
   'Subquery t
'Limit 5
 'Project [*]
  'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Subquery t
Limit 5
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Optimized Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Limit 5
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
TakeOrdered 5, [key#0 ASC]
 Project [key#0]
  Limit 5
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 

```
to 
```
== Parsed Logical Plan ==
'Limit 5
 'Sort ['key ASC], true
  'Project ['key]
   'Subquery t
'Limit 5
 'Project [*]
  'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Subquery t
Limit 5
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
== Optimized Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Physical Plan ==
TakeOrdered 5, [key#0 ASC]
 Project [key#0]
  PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
```

Summary: Combine Limit and Sort to avoid total ordering  (was: push 
down sort when it's child is Limit)

 Combine Limit and Sort to avoid total ordering
 --

 Key: SPARK-7289
 URL: https://issues.apache.org/jira/browse/SPARK-7289
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang

 Optimize following sql
 select key from (select * from testData order by key) t limit 5
 from 
 == Parsed Logical Plan ==
 'Limit 5
  'Project ['key]
   'Subquery t
'Sort ['key ASC], true
 'Project [*]
  'UnresolvedRelation [testData], None
 == Analyzed Logical Plan ==
 Limit 5
  Project [key#0]
   Subquery t
Sort [key#0 ASC], true
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Optimized Logical Plan ==
 Limit 5
  Project [key#0]
   Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Physical Plan ==
 Limit 5
  Project [key#0]
   Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
 PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 to
 == Parsed Logical Plan ==
 'Limit 5
  'Project ['key]
   'Subquery t
'Sort ['key ASC], true
 'Project [*]
  'UnresolvedRelation [testData], None
 == Analyzed Logical Plan ==
 Limit 5
  Project [key#0]
   Subquery t
Sort [key#0 ASC], true
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
 == Optimized Logical Plan ==
 Project [key#0]
  Limit 5
   Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
 == Physical Plan ==
 Project [key#0]
  TakeOrdered 5, [key#0 ASC]
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Created] (SPARK-7289) push down sort when it's child is Limit

2015-04-30 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7289:
---

 Summary: push down sort when it's child is Limit
 Key: SPARK-7289
 URL: https://issues.apache.org/jira/browse/SPARK-7289
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


Optimize following sql
`select key from (select * from testData limit 5) t order by key limit 5`

optimize it from 
```
== Parsed Logical Plan ==
'Limit 5
 'Sort ['key ASC], true
  'Project ['key]
   'Subquery t
'Limit 5
 'Project [*]
  'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Subquery t
Limit 5
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Optimized Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Limit 5
LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
TakeOrdered 5, [key#0 ASC]
 Project [key#0]
  Limit 5
   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 

```
to 
```
== Parsed Logical Plan ==
'Limit 5
 'Sort ['key ASC], true
  'Project ['key]
   'Subquery t
'Limit 5
 'Project [*]
  'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   Subquery t
Limit 5
 Project [key#0,value#1]
  Subquery testData
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
== Optimized Logical Plan ==
Limit 5
 Sort [key#0 ASC], true
  Project [key#0]
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Physical Plan ==
TakeOrdered 5, [key#0 ASC]
 Project [key#0]
  PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7232) Add a Substitution batch for spark sql analyzer

2015-04-29 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7232:
---

 Summary: Add a Substitution batch for spark sql analyzer
 Key: SPARK-7232
 URL: https://issues.apache.org/jira/browse/SPARK-7232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


Added a new batch named `Substitution` before Resolution batch. The motivation 
for this is there are kind of cases we want to do some substitution on the 
parsed logical plan before resolve it. 
Consider this two cases:
1 CTE, for cte we first build a row logical plan
'With Map(q1 - 'Subquery q1
 'Project ['key]
   'UnresolvedRelation [src], None)
 'Project [*]
  'Filter ('key = 5)
   'UnresolvedRelation [q1], None

In `With` logicalplan here is a map stored the (q1- subquery), we want first 
take off the with command and substitute the  q1 of UnresolvedRelation by the 
subquery

2 Another example is Window function, in window function user may define some 
windows, we also need substitute the window name of child by the concrete 
window. this should also done in the Substitution batch.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7163) minor refactory for HiveQl

2015-04-27 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7163:
---

 Summary: minor refactory for HiveQl
 Key: SPARK-7163
 URL: https://issues.apache.org/jira/browse/SPARK-7163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


Now hiveql is a much bigger object, to refactory hiveql to make it more clean 
and readable
1 move ASTNode related util method/object to a new object named HiveASTNodeUtil
2 delete no use method in HiveQl
3 override `sqlParser` in hivecontext by `ExtendedHiveQlParser`, instead of 
making a new `ddlParserWithHiveQL` and calling `HiveQl.parseSql` in 
hivecontext. 
4 rename HiveQl to HiveQlConverter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7123) support table.star in sqlcontext

2015-04-24 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7123:
---

 Summary: support table.star in sqlcontext
 Key: SPARK-7123
 URL: https://issues.apache.org/jira/browse/SPARK-7123
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


support this sql

SELECT r.*
FROM testData l join testData2 r on (l.key = r.a)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7093) Using newPredicate in NestedLoopJoin to enable code generation

2015-04-23 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7093:
---

 Summary: Using newPredicate in NestedLoopJoin to enable code 
generation
 Key: SPARK-7093
 URL: https://issues.apache.org/jira/browse/SPARK-7093
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fei Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7109) Push down left side filter for left semi join

2015-04-23 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7109:
---

 Summary: Push down left side filter for left semi join
 Key: SPARK-7109
 URL: https://issues.apache.org/jira/browse/SPARK-7109
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


now in spark sql optimizer we only push down right side filter, actually we can 
push down left side filter for left semi join



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5659) Flaky test: o.a.s.streaming.ReceiverSuite.block

2015-04-22 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508302#comment-14508302
 ] 

Fei Wang commented on SPARK-5659:
-

my locally test with dev/run-tests also go into this issue.

org.apache.spark.streaming.ReceiverSuite.block generator throttling

Failing for the past 2 builds (Since Aborted#3 )
运行时间:2.2 秒
添加说明
Error Message

126 was greater than or equal to 95.0, but 126 was not less than or equal to 
105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
Stacktrace

sbt.ForkMain$ForkError: 126 was greater than or equal to 95.0, but 126 was not 
less than or equal to 105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply$mcV$sp(ReceiverSuite.scala:207)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.streaming.ReceiverSuite.runTest(ReceiverSuite.scala:39)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$run(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.apache.spark.streaming.ReceiverSuite.run(ReceiverSuite.scala:39)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

 Flaky test: 

[jira] [Comment Edited] (SPARK-5659) Flaky test: o.a.s.streaming.ReceiverSuite.block

2015-04-22 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508302#comment-14508302
 ] 

Fei Wang edited comment on SPARK-5659 at 4/23/15 2:02 AM:
--

my locally test with dev/run-tests also go into this issue.

org.apache.spark.streaming.ReceiverSuite.block generator throttling

{code}
Error Message

126 was greater than or equal to 95.0, but 126 was not less than or equal to 
105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
Stacktrace

sbt.ForkMain$ForkError: 126 was greater than or equal to 95.0, but 126 was not 
less than or equal to 105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply$mcV$sp(ReceiverSuite.scala:207)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.streaming.ReceiverSuite.runTest(ReceiverSuite.scala:39)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$run(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.apache.spark.streaming.ReceiverSuite.run(ReceiverSuite.scala:39)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}



was (Author: scwf):
my locally 

[jira] [Comment Edited] (SPARK-5659) Flaky test: o.a.s.streaming.ReceiverSuite.block

2015-04-22 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508302#comment-14508302
 ] 

Fei Wang edited comment on SPARK-5659 at 4/23/15 2:03 AM:
--

my locally test with dev/run-tests also go into this issue.

{code}
org.apache.spark.streaming.ReceiverSuite.block generator throttling

Error Message

126 was greater than or equal to 95.0, but 126 was not less than or equal to 
105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
Stacktrace

sbt.ForkMain$ForkError: 126 was greater than or equal to 95.0, but 126 was not 
less than or equal to 105.0 # records in received blocks = 
[91,369,294,39,100,100,101,99,100,100,100,100,100,101,100,99,5], not between 
95.0 and 105.0, on average
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply$mcV$sp(ReceiverSuite.scala:207)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.streaming.ReceiverSuite.runTest(ReceiverSuite.scala:39)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$run(ReceiverSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.apache.spark.streaming.ReceiverSuite.run(ReceiverSuite.scala:39)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}



was (Author: scwf):
my locally 

  1   2   >