[jira] [Commented] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695203#comment-15695203 ] Sean Owen commented on SPARK-18584: --- It sounds like you ran both jobs as the same user but expect them to have different authorization. This is an error from HDFS ACLs. It isn't a Spark problem because it sounds like the jobs were not run as the user you meant to. > multiple Spark Thrift Servers running in the same machine throws > org.apache.hadoop.security.AccessControlException > -- > > Key: SPARK-18584 > URL: https://issues.apache.org/jira/browse/SPARK-18584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0 > spark2.0.2 >Reporter: tanxinz > Fix For: 2.0.2 > > > In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the > same machine . I connected by beeline etl STS to execute a command,and > throwed org.apache.hadoop.security.AccessControlException.I don't know why is > dev user perform,not etl. > ``` > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): > Permission denied: user=dev, access=EXECUTE, > inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::--- > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian returns pdf value larger than 1
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. Maybe we should just use the breeze 'det' opertion on sigma to get the right but slow answer instead of a quick, wrong one. was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} Maybe we should just use the breeze 'det' opertion on sigma to get the right but slow answer instead of a quick, wrong one. > MultivariateGaussian returns pdf value larger than 1 > > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf becomes positive (pdf > 1) > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants() > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. > Maybe we should just use the breeze 'det' opertion on sigma to get the right > but slow answer instead of a quick, wrong one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
Yanbo Liang created SPARK-18587: --- Summary: Remove handleInvalid from QuantileDiscretizer Key: SPARK-18587 URL: https://issues.apache.org/jira/browse/SPARK-18587 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Priority: Critical HandleInvalid only happens when {{Bucketizer}} transforming a dataset which contains NaN, however, when the training dataset containing NaN, {{QuantileDiscretizer}} will always ignore them. So we should keep {{handleInvalid}} in {{Bucketizer}} and remove it from {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695291#comment-15695291 ] Apache Spark commented on SPARK-18587: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16011 > Remove handleInvalid from QuantileDiscretizer > - > > Key: SPARK-18587 > URL: https://issues.apache.org/jira/browse/SPARK-18587 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Critical > > HandleInvalid only happens when {{Bucketizer}} transforming a dataset which > contains NaN, however, when the training dataset containing NaN, > {{QuantileDiscretizer}} will always ignore them. So we should keep > {{handleInvalid}} in {{Bucketizer}} and remove it from > {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18587: Assignee: (was: Apache Spark) > Remove handleInvalid from QuantileDiscretizer > - > > Key: SPARK-18587 > URL: https://issues.apache.org/jira/browse/SPARK-18587 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Critical > > HandleInvalid only happens when {{Bucketizer}} transforming a dataset which > contains NaN, however, when the training dataset containing NaN, > {{QuantileDiscretizer}} will always ignore them. So we should keep > {{handleInvalid}} in {{Bucketizer}} and remove it from > {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18587: Assignee: Apache Spark > Remove handleInvalid from QuantileDiscretizer > - > > Key: SPARK-18587 > URL: https://issues.apache.org/jira/browse/SPARK-18587 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Critical > > HandleInvalid only happens when {{Bucketizer}} transforming a dataset which > contains NaN, however, when the training dataset containing NaN, > {{QuantileDiscretizer}} will always ignore them. So we should keep > {{handleInvalid}} in {{Bucketizer}} and remove it from > {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Summary: MultivariateGaussian does not check if covariance matrix is invertible (was: MultivariateGaussian returns pdf value larger than 1) > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf becomes positive (pdf > 1) > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants() > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. > Maybe we should just use the breeze 'det' opertion on sigma to get the right > but slow answer instead of a quick, wrong one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. Maybe we should just use the breeze 'det' opertion on sigma to get the right but slow answer instead of a quick, wrong one. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf becomes positive (pdf > 1) > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants() > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf could be positive, thus pdf > 1 The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf could be positive, thus pdf > 1 The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf could be positive, thus pdf > 1 > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf could be positive, thus pdf > 1 The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf could be positive, thus pdf > 1 > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants() > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18575) Keep same style: adjust the position of driver log links
[ https://issues.apache.org/jira/browse/SPARK-18575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18575. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 16001 [https://github.com/apache/spark/pull/16001] > Keep same style: adjust the position of driver log links > > > Key: SPARK-18575 > URL: https://issues.apache.org/jira/browse/SPARK-18575 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Priority: Minor > Fix For: 2.1.0 > > > {{NOT BUG}}, just adjust the position of driver log link to keep the same > style with other executors log link. > !https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18575) Keep same style: adjust the position of driver log links
[ https://issues.apache.org/jira/browse/SPARK-18575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18575: -- Assignee: Genmao Yu Target Version/s: (was: 2.0.3, 2.1.0) Priority: Trivial (was: Minor) > Keep same style: adjust the position of driver log links > > > Key: SPARK-18575 > URL: https://issues.apache.org/jira/browse/SPARK-18575 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Trivial > Fix For: 2.1.0 > > > {{NOT BUG}}, just adjust the position of driver log link to keep the same > style with other executors log link. > !https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695359#comment-15695359 ] Sean Owen commented on SPARK-18581: --- It sounds like there's definitely a problem here but why is the pseudo-determinant the issue? it's possible for eigenvalues to be in (0,1) in which case the sum of logs is negative. Is that a problem per se? > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf could be positive, thus pdf > 1 > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18554) leader master lost the leadership, when the slave become master, the perivious app's state display as waitting
[ https://issues.apache.org/jira/browse/SPARK-18554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18554. --- Resolution: Duplicate [~jerryshao] is anything blocking you from proceeding on that PR? > leader master lost the leadership, when the slave become master, the > perivious app's state display as waitting > -- > > Key: SPARK-18554 > URL: https://issues.apache.org/jira/browse/SPARK-18554 > Project: Spark > Issue Type: Bug > Components: Deploy, Web UI >Affects Versions: 1.6.1 > Environment: java1.8 >Reporter: liujianhui >Priority: Minor > > when the leader of master lost the leadship and the slave become master, the > state of app in the webui will display waiting; this code as follow > case MasterChangeAcknowledged(appId) => { > idToApp.get(appId) match { > case Some(app) => > logInfo("Application has been re-registered: " + appId) > app.state = ApplicationState.WAITING > case None => > logWarning("Master change ack from unknown app: " + appId) > } > if (canCompleteRecovery) { completeRecovery() } > the state of app should be RUNNING instead of waiting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-18324: --- Assignee: Yanbo Liang > ML, Graph 2.1 QA: Programming guide update and migration guide > -- > > Key: SPARK-18324 > URL: https://issues.apache.org/jira/browse/SPARK-18324 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Critical > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18119) Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
[ https://issues.apache.org/jira/browse/SPARK-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18119. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15648 [https://github.com/apache/spark/pull/15648] > Namenode safemode check is only performed on one namenode which can stuck the > startup of SparkHistory server > > > Key: SPARK-18119 > URL: https://issues.apache.org/jira/browse/SPARK-18119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 > Environment: HDFS cdh 5.5.0 with HA namenode >Reporter: Nicolas Fraison >Priority: Minor > Fix For: 2.1.0 > > > SparkHistory server startup is stuck when one of the 2 HA namenode is in > safemode displaying this log message: HDFS is still in safe mode. Waiting... > This happens even if one of the 2 namenode is in active mode because it only > request the first one of 2 available namenode in an HA configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18119) Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
[ https://issues.apache.org/jira/browse/SPARK-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18119: -- Assignee: Nicolas Fraison > Namenode safemode check is only performed on one namenode which can stuck the > startup of SparkHistory server > > > Key: SPARK-18119 > URL: https://issues.apache.org/jira/browse/SPARK-18119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 > Environment: HDFS cdh 5.5.0 with HA namenode >Reporter: Nicolas Fraison >Assignee: Nicolas Fraison >Priority: Minor > Fix For: 2.1.0 > > > SparkHistory server startup is stuck when one of the 2 HA namenode is in > safemode displaying this log message: HDFS is still in safe mode. Waiting... > This happens even if one of the 2 namenode is in active mode because it only > request the first one of 2 available namenode in an HA configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193
[ https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18586: -- Priority: Major (was: Critical) Spark doesn't use netty 3, but it is pulled in as a transitive dependency. We can't get rid of it, but, it also isn't even necessarily exposed. Do these CVEs even affect Spark? We can try managing the version up to 3.8.3 to resolve one, or 3.9.x to resolve both, but this won't change the version of Netty that ends up on the classpath if deploying on an existing cluster. > netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193 > > > Key: SPARK-18586 > URL: https://issues.apache.org/jira/browse/SPARK-18586 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: meiyoula > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18548) OnlineLDAOptimizer reads the same broadcast data after deletion
[ https://issues.apache.org/jira/browse/SPARK-18548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18548. --- Resolution: Duplicate > OnlineLDAOptimizer reads the same broadcast data after deletion > --- > > Key: SPARK-18548 > URL: https://issues.apache.org/jira/browse/SPARK-18548 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: Xiaoye Sun >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > In submitMiniBatch() called by OnlineLDAOptimizer, broadcast variable > expElogbeta is deleted before its use in the second time, which causes the > executor reads the same large broadcast data twice. I suggest to move the > broadcast data deletion (expElogbetaBc.unpersist()) later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18481: Description: Remove deprecated methods for ML. We removed was:Remove deprecated methods for ML. > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > Remove deprecated methods for ML. > We removed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18481: Priority: Major (was: Minor) > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Remove deprecated methods for ML. > We removed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18481: Priority: Minor (was: Major) > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > Remove deprecated methods for ML. > We removed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18481: Description: Remove deprecated methods for ML. We removed the following public APIs: org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol org.apache.spark.ml.regression.LinearRegressionSummary.model org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees was: Remove deprecated methods for ML. We removed > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > Remove deprecated methods for ML. > We removed the following public APIs: > org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees > org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol > org.apache.spark.ml.regression.LinearRegressionSummary.model > org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18481: Description: Remove deprecated methods for ML. We removed the following public APIs in this JIRA: org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol org.apache.spark.ml.regression.LinearRegressionSummary.model org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees was: Remove deprecated methods for ML. We removed the following public APIs: org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol org.apache.spark.ml.regression.LinearRegressionSummary.model org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > Remove deprecated methods for ML. > We removed the following public APIs in this JIRA: > org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees > org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol > org.apache.spark.ml.regression.LinearRegressionSummary.model > org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-18587: --- Assignee: Yanbo Liang > Remove handleInvalid from QuantileDiscretizer > - > > Key: SPARK-18587 > URL: https://issues.apache.org/jira/browse/SPARK-18587 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > > HandleInvalid only happens when {{Bucketizer}} transforming a dataset which > contains NaN, however, when the training dataset containing NaN, > {{QuantileDiscretizer}} will always ignore them. So we should keep > {{handleInvalid}} in {{Bucketizer}} and remove it from > {{QuantileDiscretizer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695473#comment-15695473 ] Yanbo Liang commented on SPARK-18318: - Finished reviewing for all classes which were added/changed after 2.0, open SPARK-18587 and SPARK-18481 for two major issues, and address other minor issues at PR #16009. Thanks. > ML, Graph 2.1 QA: API: New Scala APIs, docs > --- > > Key: SPARK-18318 > URL: https://issues.apache.org/jira/browse/SPARK-18318 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-18581: Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. In my case, I have a feature = 0 for all data point. As a result, covariance matrix is not invertible <=> det(covariance matrix) = 0 => pseudo-determinant will be very close to zero, Thus, log(pseudo-determinant) will be a large negative number which finally make logpdf very biger, pdf will be even bigger > 1. As said in comments of MultivariateGaussian.scala, """ Singular values are considered to be non-zero only if they exceed a tolerance based on machine precision. """ But if a singular value is considered to be zero, means the covariance matrix is non invertible which is a contradiction to the assumption that it should be invertible. So we should check if there a single value is smaller than the tolerance before computing the pseudo determinant was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf could be positive, thus pdf > 1 The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695481#comment-15695481 ] Hao Ren commented on SPARK-18581: - [~srowen] I have updated the description. The problem is that my covariance matrix is non invertible, since one of the features is zero for all data points. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695552#comment-15695552 ] Sean Owen commented on SPARK-18581: --- Yes, but it need not be invertible, for the reason you give. It looks like it handles this in the code. pinvS is a pseudo-inverse of the eigenvalue diagonal matrix, which can have zeroes. Backing up though, I re-read and see you're saying you get a PDF > 1, but, that's perfectly normal. PDF does not need to be <= 1. Are you, however, saying you observe a big numeric inaccuracy in this case? > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16151) Make generated params non-final
[ https://issues.apache.org/jira/browse/SPARK-16151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695614#comment-15695614 ] zhengruifeng commented on SPARK-16151: -- Some param shoud be made non-final : such as {{setSolver}} [SPARK-18518] {{setStepSize}} [https://github.com/apache/spark/pull/15913] I can work on this. > Make generated params non-final > --- > > Key: SPARK-16151 > URL: https://issues.apache.org/jira/browse/SPARK-16151 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > Currently, all generated param instances are final. There are several > scenarios where we need them to be non-final: > 1. We don't have a guideline about where we should document the param doc. > Some were inherited from the generated param traits directly, while some were > documented in the setters if we want to make changes. I think it is better to > have all documented in the param instances, which appear at the top the > generated API doc. However, this requires inherit the param instance. > {code} > /** > * new doc > */ > val param: Param = super.param > {code} > We can use `@inherit_doc` if we just want to add a few words, e.g., the > default value, to the inherited doc. > 2. We might want to update the embedded doc in the param instance. > Opened this JIRA for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16151) Make generated params non-final
[ https://issues.apache.org/jira/browse/SPARK-16151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695624#comment-15695624 ] zhengruifeng commented on SPARK-16151: -- I suggest that we can add a option to make generated param non-final: {code} private case class ParamDesc[T: ClassTag]( name: String, doc: String, defaultValueStr: Option[String] = None, isValid: String = "", finalFields: Boolean = true, // new option to control whether generated param have a final field or not finalMethods: Boolean = true, isExpertParam: Boolean = false) { {code} > Make generated params non-final > --- > > Key: SPARK-16151 > URL: https://issues.apache.org/jira/browse/SPARK-16151 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > Currently, all generated param instances are final. There are several > scenarios where we need them to be non-final: > 1. We don't have a guideline about where we should document the param doc. > Some were inherited from the generated param traits directly, while some were > documented in the setters if we want to make changes. I think it is better to > have all documented in the param instances, which appear at the top the > generated API doc. However, this requires inherit the param instance. > {code} > /** > * new doc > */ > val param: Param = super.param > {code} > We can use `@inherit_doc` if we just want to add a few words, e.g., the > default value, to the inherited doc. > 2. We might want to update the embedded doc in the param instance. > Opened this JIRA for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8
[ https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3359: - Assignee: Hyukjin Kwon > `sbt/sbt unidoc` doesn't work with Java 8 > - > > Key: SPARK-3359 > URL: https://issues.apache.org/jira/browse/SPARK-3359 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Minor > > It seems that Java 8 is stricter on JavaDoc. I got many error messages like > {code} > [error] > /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2: > error: modifier private not allowed here > [error] private abstract interface SparkHadoopMapRedUtil { > [error] ^ > {code} > This is minor because we can always use Java 6/7 to generate the doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18584: -- Target Version/s: (was: 2.0.2) Fix Version/s: (was: 2.0.2) > multiple Spark Thrift Servers running in the same machine throws > org.apache.hadoop.security.AccessControlException > -- > > Key: SPARK-18584 > URL: https://issues.apache.org/jira/browse/SPARK-18584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0 > spark2.0.2 >Reporter: tanxinz > > In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the > same machine . I connected by beeline etl STS to execute a command,and > throwed org.apache.hadoop.security.AccessControlException.I don't know why is > dev user perform,not etl. > ``` > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): > Permission denied: user=dev, access=EXECUTE, > inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::--- > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator
[ https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17251: Assignee: (was: Apache Spark) > "ClassCastException: OuterReference cannot be cast to NamedExpression" for > correlated subquery on the RHS of an IN operator > --- > > Key: SPARK-17251 > URL: https://issues.apache.org/jira/browse/SPARK-17251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen > > The following test case produces a ClassCastException in the analyzer: > {code} > CREATE TABLE t1(a INTEGER); > INSERT INTO t1 VALUES(1),(2); > CREATE TABLE t2(b INTEGER); > INSERT INTO t2 VALUES(1); > SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2); > {code} > Here's the exception: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to > org.apache.spark.sql.catalyst.expressions.NamedExpression > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.c
[jira] [Commented] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator
[ https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695645#comment-15695645 ] Apache Spark commented on SPARK-17251: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16012 > "ClassCastException: OuterReference cannot be cast to NamedExpression" for > correlated subquery on the RHS of an IN operator > --- > > Key: SPARK-17251 > URL: https://issues.apache.org/jira/browse/SPARK-17251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen > > The following test case produces a ClassCastException in the analyzer: > {code} > CREATE TABLE t1(a INTEGER); > INSERT INTO t1 VALUES(1),(2); > CREATE TABLE t2(b INTEGER); > INSERT INTO t2 VALUES(1); > SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2); > {code} > Here's the exception: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to > org.apache.spark.sql.catalyst.expressions.NamedExpression > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.
[jira] [Assigned] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator
[ https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17251: Assignee: Apache Spark > "ClassCastException: OuterReference cannot be cast to NamedExpression" for > correlated subquery on the RHS of an IN operator > --- > > Key: SPARK-17251 > URL: https://issues.apache.org/jira/browse/SPARK-17251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > The following test case produces a ClassCastException in the analyzer: > {code} > CREATE TABLE t1(a INTEGER); > INSERT INTO t1 VALUES(1),(2); > CREATE TABLE t2(b INTEGER); > INSERT INTO t2 VALUES(1); > SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2); > {code} > Here's the exception: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to > org.apache.spark.sql.catalyst.expressions.NamedExpression > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58) > at
[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A
[ https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695800#comment-15695800 ] Genmao Yu commented on SPARK-18512: --- [~giuseppe.bonaccorso] How can i reproduce this failure? Can u provide some code snippet? > FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and > S3A > > > Key: SPARK-18512 > URL: https://issues.apache.org/jira/browse/SPARK-18512 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 > Environment: AWS EMR 5.0.1 > Spark 2.0.1 > S3 EU-West-1 (S3A with read-after-write consistency) >Reporter: Giuseppe Bonaccorso > > After a few hours of streaming processing and data saving in Parquet format, > I got always this exception: > {code:java} > java.io.FileNotFoundException: No such file or directory: > s3a://xxx/_temporary/0/task_ > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) > {code} > I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18565) subtractByKey modifes values in the source RDD
[ https://issues.apache.org/jira/browse/SPARK-18565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695809#comment-15695809 ] Dmitry Dzhus commented on SPARK-18565: -- The problem was that I assumed that caching and forcing an RDD guarantees that it will never be re-evaluated. inverseItemIDMap is built using a non-determenistic operation, and multiple uses of it also give different results: {code} val itemIDMap: RDD[(ContentKey, InternalContentId)] = rawEvents .map(_.content) .distinct .zipWithUniqueId() .map(u => (u._1, u._2.toInt)) .cache() logger.info(s"Built a map of ${itemIDMap.count()} item IDs") val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] = itemIDMap.map(_.swap).cache() {code} I made the operation stable by adding {{.sortBy(c => c)}} and this solved the issue. > subtractByKey modifes values in the source RDD > -- > > Key: SPARK-18565 > URL: https://issues.apache.org/jira/browse/SPARK-18565 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Amazon Elastic MapReduce (emr-5.2.0) >Reporter: Dmitry Dzhus > > I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala > 2.11.x: > Relevant code: > {code} > object Types { >type ContentId = Int >type ContentKey = Tuple2[Int, ContentId] >type InternalContentId = Int > } > val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] = > itemIDMap.map(_.swap).cache() > logger.info(s"Built an inverse map of ${inverseItemIDMap.count()} item IDs") > logger.info(inverseItemIDMap.collect().mkString("I->E ", "\nI->E ", "")) > val superfluousItems: RDD[(InternalContentId, Int)] = .. .cache() > logger.info(superfluousItems.collect().mkString("SI ", "\nSI ", "")) > val filteredInverseItemIDMap: RDD[(InternalContentId, ContentKey)] = > inverseItemIDMap.subtractByKey(superfluousItems).cache() // <<===!!! > logger.info(s"${filteredInverseItemIDMap.count()} items in the filtered > inverse ID mapping") > logger.info(filteredInverseItemIDMap.collect().mkString("F I->E ", "\nF I->E > ", "")) > {code} > The operation in question is {{.subtractByKey}}. Both RDDs involved are > cached and forced via {{count()}} prior to calling {{subtractByKey}}, so I > would expect the result to be unaffected by how exactly superfluousItems is > built. > I added debugging output and filtered the resulting logs by relevant > InternalContentId values (829911, 830071). Output: > {code} > Built an inverse map of 827354 item IDs > . > . > I->E (829911,(2,1135081)) > I->E (830071,(1,2295102)) > . > . > 748190 items in the training set had less than 28 ratings > SI (829911,3) > . > . > 79164 items in the filtered inverse ID mapping > F I->E (830071,(2,1135081)) > {code} > There's no element with key 830071 in {{superfluousItems}} (SI), so it's not > removed from the source RDD. However, its value is for some reason replaced > with the one from key 829911. How could this be? I cannot reproduce it > locally - only when running on a multi-machine cluster. Is this a bug or I'm > missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18565) subtractByKey modifes values in the source RDD
[ https://issues.apache.org/jira/browse/SPARK-18565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Dzhus resolved SPARK-18565. -- Resolution: Invalid > subtractByKey modifes values in the source RDD > -- > > Key: SPARK-18565 > URL: https://issues.apache.org/jira/browse/SPARK-18565 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Amazon Elastic MapReduce (emr-5.2.0) >Reporter: Dmitry Dzhus > > I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala > 2.11.x: > Relevant code: > {code} > object Types { >type ContentId = Int >type ContentKey = Tuple2[Int, ContentId] >type InternalContentId = Int > } > val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] = > itemIDMap.map(_.swap).cache() > logger.info(s"Built an inverse map of ${inverseItemIDMap.count()} item IDs") > logger.info(inverseItemIDMap.collect().mkString("I->E ", "\nI->E ", "")) > val superfluousItems: RDD[(InternalContentId, Int)] = .. .cache() > logger.info(superfluousItems.collect().mkString("SI ", "\nSI ", "")) > val filteredInverseItemIDMap: RDD[(InternalContentId, ContentKey)] = > inverseItemIDMap.subtractByKey(superfluousItems).cache() // <<===!!! > logger.info(s"${filteredInverseItemIDMap.count()} items in the filtered > inverse ID mapping") > logger.info(filteredInverseItemIDMap.collect().mkString("F I->E ", "\nF I->E > ", "")) > {code} > The operation in question is {{.subtractByKey}}. Both RDDs involved are > cached and forced via {{count()}} prior to calling {{subtractByKey}}, so I > would expect the result to be unaffected by how exactly superfluousItems is > built. > I added debugging output and filtered the resulting logs by relevant > InternalContentId values (829911, 830071). Output: > {code} > Built an inverse map of 827354 item IDs > . > . > I->E (829911,(2,1135081)) > I->E (830071,(1,2295102)) > . > . > 748190 items in the training set had less than 28 ratings > SI (829911,3) > . > . > 79164 items in the filtered inverse ID mapping > F I->E (830071,(2,1135081)) > {code} > There's no element with key 830071 in {{superfluousItems}} (SI), so it's not > removed from the source RDD. However, its value is for some reason replaced > with the one from key 829911. How could this be? I cannot reproduce it > locally - only when running on a multi-machine cluster. Is this a bug or I'm > missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18356. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15965 [https://github.com/apache/spark/pull/15965] > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Assignee: zakaria hili >Priority: Minor > Labels: easyfix > Fix For: 2.2.0 > > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18356: -- Assignee: zakaria hili > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Assignee: zakaria hili >Priority: Minor > Labels: easyfix > Fix For: 2.2.0 > > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
[ https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18538: -- Priority: Blocker (was: Critical) > Concurrent Fetching DataFrameReader JDBC APIs Do Not Work > - > > Key: SPARK-18538 > URL: https://issues.apache.org/jira/browse/SPARK-18538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > {code} > def jdbc( > url: String, > table: String, > columnName: String, > lowerBound: Long, > upperBound: Long, > numPartitions: Int, > connectionProperties: Properties): DataFrame > {code} > {code} > def jdbc( > url: String, > table: String, > predicates: Array[String], > connectionProperties: Properties): DataFrame > {code} > The above two DataFrameReader JDBC APIs ignore the user-specified parameters > of parallelism degree -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695922#comment-15695922 ] holdenk commented on SPARK-18541: - Making it easier for PySpark SQL users to specify metadata sounds interesting/useful. I'd probably try and choose something closer to the scala API (e.g. implement `as` instead of `aliasWithMetadata`). What do [~davies] / [~marmbrus] think? > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Priority: Minor > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18532) Code generation memory issue
[ https://issues.apache.org/jira/browse/SPARK-18532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-18532: Component/s: (was: Spark Core) SQL > Code generation memory issue > > > Key: SPARK-18532 > URL: https://issues.apache.org/jira/browse/SPARK-18532 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: osx / macbook pro / local spark >Reporter: Georg Heiler > > Trying to create a spark data frame with multiple additional columns based on > conditions like this > df > .withColumn("name1", someCondition1) > .withColumn("name2", someCondition2) > .withColumn("name3", someCondition3) > .withColumn("name4", someCondition4) > .withColumn("name5", someCondition5) > .withColumn("name6", someCondition6) > .withColumn("name7", someCondition7) > I am faced with the following exception in case more than 6 .withColumn > clauses are added > org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > f even more columns are created e.g. around 20 I do no longer receive the > aforementioned exception, but rather get the following error after 5 minutes > of waiting: > java.lang.OutOfMemoryError: GC overhead limit exceeded > What I want to perform is a spelling/error correction. some simple cases > could be handled easily via a map& replacement in a UDF. Still, several other > cases with multiple chained conditions remain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18502) Spark does not handle columns that contain backquote (`)
[ https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-18502: Component/s: (was: Spark Core) SQL > Spark does not handle columns that contain backquote (`) > > > Key: SPARK-18502 > URL: https://issues.apache.org/jira/browse/SPARK-18502 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Barry Becker >Priority: Minor > > I know that if a column contains dots or hyphens we can put > backquotes/backticks around it, but what if the column contains a backtick > (`)? Can the back tick be escaped by some means? > Here is an example of the sort of error I see > {code} > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90) > org.apache.spark.sql.Column.(Column.scala:113) > org.apache.spark.sql.Column$.apply(Column.scala:36) > org.apache.spark.sql.functions$.min(functions.scala:407) > com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695936#comment-15695936 ] holdenk commented on SPARK-18405: - Even in cluster mode you could overwhelm the node running the thriftserver interface, or are you proposing to expose the thrifserver interface on multiple nodes not just the driver program? > Add yarn-cluster mode support to Spark Thrift Server > > > Key: SPARK-18405 > URL: https://issues.apache.org/jira/browse/SPARK-18405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1 >Reporter: Prabhu Kasinathan > Labels: Spark, ThriftServer2 > > Currently, spark thrift server can run only on yarn-client mode. > Can we add Yarn-Cluster mode support to spark thrift server? > This will help us to launch multiple spark thrift server with different spark > configurations and it really help in large distributed clusters where there > is requirement to run complex sqls through STS. With client mode, there is a > chance to overload local host with too much driver memory. > Please let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695939#comment-15695939 ] holdenk commented on SPARK-18128: - Thanks! :) I'll start working on this issue once we start work on 2.2 :) > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695941#comment-15695941 ] holdenk commented on SPARK-18128: - Thanks! :) I'll start working on this issue once we start work on 2.2 :) > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren commented on SPARK-18581: - After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If { v.t * v * -0.5 } is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:04 PM: --- After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If `v.t * v * -0.5` is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If { v.t * v * -0.5 } is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:05 PM: --- After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If `v.t * v * -0.5` is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be ve
[jira] [Updated] (SPARK-18108) Partition discovery fails with explicitly written long partitions
[ https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-18108: Component/s: (was: Spark Core) SQL > Partition discovery fails with explicitly written long partitions > - > > Key: SPARK-18108 > URL: https://issues.apache.org/jira/browse/SPARK-18108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Richard Moorhead >Priority: Minor > Attachments: stacktrace.out > > > We have parquet data written from Spark1.6 that, when read from 2.0.1, > produces errors. > {code} > case class A(a: Long, b: Int) > val as = Seq(A(1,2)) > //partition explicitly written > spark.createDataFrame(as).write.parquet("/data/a=1/") > spark.read.parquet("/data/").collect > {code} > The above code fails; stack trace attached. > If an integer used, explicit partition discovery succeeds. > {code} > case class A(a: Int, b: Int) > val as = Seq(A(1,2)) > //partition explicitly written > spark.createDataFrame(as).write.parquet("/data/a=1/") > spark.read.parquet("/data/").collect > {code} > The action succeeds. Additionally, if 'partitionBy' is used instead of > explicit writes, partition discovery succeeds. > Question: Is the first example a reasonable use case? > [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319] > seems to default to Integer types unless the partition value exceeds the > integer type's length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:06 PM: --- After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:07 PM: --- After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinan
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:08 PM: --- After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-deter
[jira] [Commented] (SPARK-18554) leader master lost the leadership, when the slave become master, the perivious app's state display as waitting
[ https://issues.apache.org/jira/browse/SPARK-18554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695947#comment-15695947 ] Saisai Shao commented on SPARK-18554: - Nothing blocked actually, just no one reviewed that PR, also it is not a big issue (only the state is not showing correctly). > leader master lost the leadership, when the slave become master, the > perivious app's state display as waitting > -- > > Key: SPARK-18554 > URL: https://issues.apache.org/jira/browse/SPARK-18554 > Project: Spark > Issue Type: Bug > Components: Deploy, Web UI >Affects Versions: 1.6.1 > Environment: java1.8 >Reporter: liujianhui >Priority: Minor > > when the leader of master lost the leadship and the slave become master, the > state of app in the webui will display waiting; this code as follow > case MasterChangeAcknowledged(appId) => { > idToApp.get(appId) match { > case Some(app) => > logInfo("Application has been re-registered: " + appId) > app.state = ApplicationState.WAITING > case None => > logWarning("Master change ack from unknown app: " + appId) > } > if (canCompleteRecovery) { completeRecovery() } > the state of app should be RUNNING instead of waiting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:09 PM: --- After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? According to wikipedia, I can't see any reason the pdf could be larger than 1. Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, c
[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] Hao Ren edited comment on SPARK-18581 at 11/25/16 2:14 PM: --- After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? According to wikipedia, I can't see any reason the pdf could be larger than 1. Maybe I missed something. Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 was (Author: invkrh): After reading the code comments, I find it takes into consideration the degenerate case of multivariate normal distribution: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case I agree that the covariance matrix need not be invertible. However, the pdf of gaussian should always be smaller than 1, shouldn't it ? According to wikipedia, I can't see any reason the pdf could be larger than 1. Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function: The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following: DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234) Meanwhile, the non-zero tolerance = 1.8514678433708895E-13 thus, {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} logPseudoDetSigma = -58.40781006437829 -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (same variable name in the code) Knowing that {code} private[mllib] def logpdf(x: BV[Double]): Double = { val delta = x - breezeMu val v = rootSigmaInv * delta u + v.t * v * -0.5 // u is used here } {code} If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351 In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10 > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant
[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695956#comment-15695956 ] holdenk commented on SPARK-17788: - This is semi-expected behaviour of the range partitioner (and really all Spark partitioners) don't support creating a split on the same key (e.g. 70% of your data has the same key and you are partitioning on that key 70% of that day is going to end up in the same partition). We could try and fix this in a few ways - either by having Spark SQL do something special in this case or having Spark's sortBy automatically add "noise" to the key when the sampling indicates there is too much data for a given key or allowing partitioners to be non-determinstic and updating the general sortBy logic in Spark. I think this would be something good for us to consider - but it's probably going to take awhile (and certainly not in time for 2.1.0). > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-17788: Target Version/s: (was: 2.1.0) > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695967#comment-15695967 ] holdenk commented on SPARK-636: --- If you have a logging system you want to initialize wouldn't using an object with lazy initialization on call be sufficient? > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695975#comment-15695975 ] Herman van Hovell commented on SPARK-18588: --- cc [~zsxwing] > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
Herman van Hovell created SPARK-18588: - Summary: KafkaSourceStressForDontFailOnDataLossSuite is flaky Key: SPARK-18588 URL: https://issues.apache.org/jira/browse/SPARK-18588 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Herman van Hovell https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5190) Allow spark listeners to be added before spark context gets initialized.
[ https://issues.apache.org/jira/browse/SPARK-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696003#comment-15696003 ] holdenk commented on SPARK-5190: This seems to be fixed, but we forgot to close (cc [~joshrosen]) > Allow spark listeners to be added before spark context gets initialized. > > > Key: SPARK-5190 > URL: https://issues.apache.org/jira/browse/SPARK-5190 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis > > Currently, you need the spark context to add spark listener events. But, if > you wait until the spark context gets created before adding your listener you > might miss events like blockManagerAdded or executorAdded. We should fix this > so you can attach a listener to the spark context before it starts any > initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3348) Support user-defined SparkListeners properly
[ https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-3348. Resolution: Duplicate > Support user-defined SparkListeners properly > > > Key: SPARK-3348 > URL: https://issues.apache.org/jira/browse/SPARK-3348 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > > Because of the current initialization ordering, user-defined SparkListeners > do not receive certain events that are posted before application code is run. > We need to expose a constructor that allows the given SparkListeners to > receive all events. > There have been interest in this for a while, but I have searched through the > JIRA history and have not found a related issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696016#comment-15696016 ] holdenk commented on SPARK-5997: That could work, although we'd probably want a different API and we'd need to be clear that the result doesn't have a known partitioner. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6522) Standardize Random Number Generation
[ https://issues.apache.org/jira/browse/SPARK-6522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-6522. -- Resolution: Fixed Fix Version/s: 1.1.0 > Standardize Random Number Generation > > > Key: SPARK-6522 > URL: https://issues.apache.org/jira/browse/SPARK-6522 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: RJ Nowling >Priority: Minor > Fix For: 1.1.0 > > > Generation of random numbers in Spark has to be handled carefully since > references to RNGs copy the state to the workers. As such, a separate RNG > needs to be seeded for each partition. Each time random numbers are used in > Spark's libraries, the RNG seeding is re-implemented, leaving open the > possibility of mistakes. > It would be useful if RNG seeding was standardized through utility functions > or random number generation functions that can be called in Spark pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6522) Standardize Random Number Generation
[ https://issues.apache.org/jira/browse/SPARK-6522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696024#comment-15696024 ] holdenk commented on SPARK-6522: We have a standardized RDD generator in MLlib (see the RandomRDDs object). > Standardize Random Number Generation > > > Key: SPARK-6522 > URL: https://issues.apache.org/jira/browse/SPARK-6522 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: RJ Nowling >Priority: Minor > Fix For: 1.1.0 > > > Generation of random numbers in Spark has to be handled carefully since > references to RNGs copy the state to the workers. As such, a separate RNG > needs to be seeded for each partition. Each time random numbers are used in > Spark's libraries, the RNG seeding is re-implemented, leaving open the > possibility of mistakes. > It would be useful if RNG seeding was standardized through utility functions > or random number generation functions that can be called in Spark pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696030#comment-15696030 ] Sean Owen commented on SPARK-18581: --- The PDF largest at the mean, and it can be > 1 if the determinant of the covariance matrix is sufficiently small. This is like the univariate case where the variance is small - the distribution is very "peaked" at the mean and the PDF gets arbitrarily high. See for example https://www.wolframalpha.com/input/?i=Plot+N(0,1e-10) Can you compare this to results you might get with R or something to see if the numbers match? numerical accuracy does become an issue as the matrix gets near-singular but that's what this cutoff is supposed to help address. We might be able to rearrange some of the math for better accuracy too. But let's first verify there's an issue. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696056#comment-15696056 ] Hao Ren commented on SPARK-18581: - Thank you for the clarification. I totally missed that part. I will compare the result to R. > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696134#comment-15696134 ] Herman van Hovell commented on SPARK-17788: --- Spark makes a sketch of your data as soon when you want to order the entire dataset. Based on that sketch Spark tries to create equally sized partitions. As [~holdenk]] said, your problem is caused by skew (a lot of rows with the same key), and none of the current partitioning schemes can help you with this. On the short run, you could follow her suggestion and add noise to the order (this only works for global ordering and not for joins/aggregation with skewed values). On the long run, there is an ongoing effort to reduce skew for joining, see SPARK-9862 for more information. I have creates the follow little spark program to illustrate how range partitioning works: {noformat} import org.apache.spark.sql.Row // Set the partitions and parallelism to relatively low value so we can read the results. spark.conf.set("spark.default.parallelism", "20") spark.conf.set("spark.sql.shuffle.partitions", "20") // Create a skewed data frame. val df = spark .range(1000) .select( $"id", (rand(34) * when($"id" % 10 <= 7, lit(1.0)).otherwise(lit(10.0))).as("value")) // Make a summary per partition. The partition intervals should not overlap and the number of // elements in a partition should roughly be the same for all partitions. case class PartitionSummary(count: Long, min: Double, max: Double, range: Double) val res = df.orderBy($"value").mapPartitions { iterator => val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, Double.NegativeInfinity)) { case ((count, min, max), Row(_, value: Double)) => (count + 1L, Math.min(min, value), Math.max(max, value)) } Iterator.single(PartitionSummary(count, min, max, max - min)) } // Get results and make them look nice res.orderBy($"min") .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), $"range".cast("decimal(5,3)")) .show(30) {noformat} This yields the following results (notice how the partition range varies and the row count is relatively similar): {noformat} +--+-+--+-+ | count| min| max|range| +--+-+--+-+ |484005|0.000| 0.059|0.059| |426212|0.059| 0.111|0.052| |381796|0.111| 0.157|0.047| |519954|0.157| 0.221|0.063| |496842|0.221| 0.281|0.061| |539082|0.281| 0.347|0.066| |516798|0.347| 0.410|0.063| |558487|0.410| 0.478|0.068| |419825|0.478| 0.529|0.051| |402257|0.529| 0.578|0.049| |557225|0.578| 0.646|0.068| |518626|0.646| 0.710|0.063| |611478|0.710| 0.784|0.075| |544556|0.784| 0.851|0.066| |454356|0.851| 0.906|0.055| |450535|0.906| 0.961|0.055| |575996|0.961| 2.290|1.329| |525915|2.290| 4.920|2.630| |518757|4.920| 7.510|2.590| |497298|7.510|10.000|2.490| +--+-+--+-+ {noformat} > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40ma
[jira] [Closed] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-17788. - Resolution: Duplicate > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696154#comment-15696154 ] Herman van Hovell commented on SPARK-17788: --- I am closing this one as a duplicate. Feel free to reopen if you disagree. > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18220: -- Description: Error message is below. {noformat} == 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 9223372036854775807} 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 1220 bytes result sent to driver 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 42) in 116 ms on localhost (executor driver) (19/20) 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 35) java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ORC dump info. == File Version: 0.12 with HIVE_8732 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from hdfs://XXX/part-0 with {include: null, offset: 0, length: 9223372036854775807} 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema. Rows: 7 Compression: ZLIB Compression size: 262144 Type: struct {noformat} was: Error message is below. == 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 9223372036854775807} 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 1220 bytes result sent to driver 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 42) in 116 ms on localhost (executor driver) (19/20) 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 35) java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:
[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696188#comment-15696188 ] holdenk commented on SPARK-17788: - I don't think this is a duplicate - its related but a join doesn't necessarily use a range partitioner and sortBy is a different operation. I agree the potential solution could share a lot the same underlying implementation. > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reopened SPARK-17788: - This is somewhat distinct from the join case, but certainly related. > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696134#comment-15696134 ] Herman van Hovell edited comment on SPARK-17788 at 11/25/16 4:10 PM: - Spark makes a sketch of your data as soon when you want to order the entire dataset. Based on that sketch Spark tries to create equally sized partitions. As [~holdenk] said, your problem is caused by skew (a lot of rows with the same key), and none of the current partitioning schemes can help you with this. On the short run, you could follow her suggestion and add noise to the order (this only works for global ordering and not for joins/aggregation with skewed values). On the long run, there is an ongoing effort to reduce skew for joining, see SPARK-9862 for more information. I have creates the follow little spark program to illustrate how range partitioning works: {noformat} import org.apache.spark.sql.Row // Set the partitions and parallelism to relatively low value so we can read the results. spark.conf.set("spark.default.parallelism", "20") spark.conf.set("spark.sql.shuffle.partitions", "20") // Create a skewed data frame. val df = spark .range(1000) .select( $"id", (rand(34) * when($"id" % 10 <= 7, lit(1.0)).otherwise(lit(10.0))).as("value")) // Make a summary per partition. The partition intervals should not overlap and the number of // elements in a partition should roughly be the same for all partitions. case class PartitionSummary(count: Long, min: Double, max: Double, range: Double) val res = df.orderBy($"value").mapPartitions { iterator => val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, Double.NegativeInfinity)) { case ((count, min, max), Row(_, value: Double)) => (count + 1L, Math.min(min, value), Math.max(max, value)) } Iterator.single(PartitionSummary(count, min, max, max - min)) } // Get results and make them look nice res.orderBy($"min") .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), $"range".cast("decimal(5,3)")) .show(30) {noformat} This yields the following results (notice how the partition range varies and the row count is relatively similar): {noformat} +--+-+--+-+ | count| min| max|range| +--+-+--+-+ |484005|0.000| 0.059|0.059| |426212|0.059| 0.111|0.052| |381796|0.111| 0.157|0.047| |519954|0.157| 0.221|0.063| |496842|0.221| 0.281|0.061| |539082|0.281| 0.347|0.066| |516798|0.347| 0.410|0.063| |558487|0.410| 0.478|0.068| |419825|0.478| 0.529|0.051| |402257|0.529| 0.578|0.049| |557225|0.578| 0.646|0.068| |518626|0.646| 0.710|0.063| |611478|0.710| 0.784|0.075| |544556|0.784| 0.851|0.066| |454356|0.851| 0.906|0.055| |450535|0.906| 0.961|0.055| |575996|0.961| 2.290|1.329| |525915|2.290| 4.920|2.630| |518757|4.920| 7.510|2.590| |497298|7.510|10.000|2.490| +--+-+--+-+ {noformat} was (Author: hvanhovell): Spark makes a sketch of your data as soon when you want to order the entire dataset. Based on that sketch Spark tries to create equally sized partitions. As [~holdenk]] said, your problem is caused by skew (a lot of rows with the same key), and none of the current partitioning schemes can help you with this. On the short run, you could follow her suggestion and add noise to the order (this only works for global ordering and not for joins/aggregation with skewed values). On the long run, there is an ongoing effort to reduce skew for joining, see SPARK-9862 for more information. I have creates the follow little spark program to illustrate how range partitioning works: {noformat} import org.apache.spark.sql.Row // Set the partitions and parallelism to relatively low value so we can read the results. spark.conf.set("spark.default.parallelism", "20") spark.conf.set("spark.sql.shuffle.partitions", "20") // Create a skewed data frame. val df = spark .range(1000) .select( $"id", (rand(34) * when($"id" % 10 <= 7, lit(1.0)).otherwise(lit(10.0))).as("value")) // Make a summary per partition. The partition intervals should not overlap and the number of // elements in a partition should roughly be the same for all partitions. case class PartitionSummary(count: Long, min: Double, max: Double, range: Double) val res = df.orderBy($"value").mapPartitions { iterator => val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, Double.NegativeInfinity)) { case ((count, min, max), Row(_, value: Double)) => (count + 1L, Math.min(min, value), Math.max(max, value)) } Iterator.single(PartitionSummary(count, min, max, max - min)) } // Get results and make them look nice res.orderBy($"min") .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), $"range".cast("decimal(5,3)")) .show(30) {noformat} This yields
[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696134#comment-15696134 ] Herman van Hovell edited comment on SPARK-17788 at 11/25/16 4:56 PM: - Spark makes a sketch of your data as soon when you want to order the entire dataset. Based on that sketch Spark tries to create equally sized partitions. As [~holdenk] said, your problem is caused by skew (a lot of rows with the same key), and none of the current partitioning schemes can help you with this. On the short run, you could follow her suggestion and add noise to the order (this only works for global ordering and not for joins/aggregation with skewed values). On the long run, there is an ongoing effort to reduce skew for joining, see SPARK-9862 for more information. I have created the follow little spark program to illustrate how range partitioning works: {noformat} import org.apache.spark.sql.Row // Set the partitions and parallelism to relatively low value so we can read the results. spark.conf.set("spark.default.parallelism", "20") spark.conf.set("spark.sql.shuffle.partitions", "20") // Create a skewed data frame. val df = spark .range(1000) .select( $"id", (rand(34) * when($"id" % 10 <= 7, lit(1.0)).otherwise(lit(10.0))).as("value")) // Make a summary per partition. The partition intervals should not overlap and the number of // elements in a partition should roughly be the same for all partitions. case class PartitionSummary(count: Long, min: Double, max: Double, range: Double) val res = df.orderBy($"value").mapPartitions { iterator => val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, Double.NegativeInfinity)) { case ((count, min, max), Row(_, value: Double)) => (count + 1L, Math.min(min, value), Math.max(max, value)) } Iterator.single(PartitionSummary(count, min, max, max - min)) } // Get results and make them look nice res.orderBy($"min") .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), $"range".cast("decimal(5,3)")) .show(30) {noformat} This yields the following results (notice how the partition range varies and the row count is relatively similar): {noformat} +--+-+--+-+ | count| min| max|range| +--+-+--+-+ |484005|0.000| 0.059|0.059| |426212|0.059| 0.111|0.052| |381796|0.111| 0.157|0.047| |519954|0.157| 0.221|0.063| |496842|0.221| 0.281|0.061| |539082|0.281| 0.347|0.066| |516798|0.347| 0.410|0.063| |558487|0.410| 0.478|0.068| |419825|0.478| 0.529|0.051| |402257|0.529| 0.578|0.049| |557225|0.578| 0.646|0.068| |518626|0.646| 0.710|0.063| |611478|0.710| 0.784|0.075| |544556|0.784| 0.851|0.066| |454356|0.851| 0.906|0.055| |450535|0.906| 0.961|0.055| |575996|0.961| 2.290|1.329| |525915|2.290| 4.920|2.630| |518757|4.920| 7.510|2.590| |497298|7.510|10.000|2.490| +--+-+--+-+ {noformat} was (Author: hvanhovell): Spark makes a sketch of your data as soon when you want to order the entire dataset. Based on that sketch Spark tries to create equally sized partitions. As [~holdenk] said, your problem is caused by skew (a lot of rows with the same key), and none of the current partitioning schemes can help you with this. On the short run, you could follow her suggestion and add noise to the order (this only works for global ordering and not for joins/aggregation with skewed values). On the long run, there is an ongoing effort to reduce skew for joining, see SPARK-9862 for more information. I have creates the follow little spark program to illustrate how range partitioning works: {noformat} import org.apache.spark.sql.Row // Set the partitions and parallelism to relatively low value so we can read the results. spark.conf.set("spark.default.parallelism", "20") spark.conf.set("spark.sql.shuffle.partitions", "20") // Create a skewed data frame. val df = spark .range(1000) .select( $"id", (rand(34) * when($"id" % 10 <= 7, lit(1.0)).otherwise(lit(10.0))).as("value")) // Make a summary per partition. The partition intervals should not overlap and the number of // elements in a partition should roughly be the same for all partitions. case class PartitionSummary(count: Long, min: Double, max: Double, range: Double) val res = df.orderBy($"value").mapPartitions { iterator => val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, Double.NegativeInfinity)) { case ((count, min, max), Row(_, value: Double)) => (count + 1L, Math.min(min, value), Math.max(max, value)) } Iterator.single(PartitionSummary(count, min, max, max - min)) } // Get results and make them look nice res.orderBy($"min") .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), $"range".cast("decimal(5,3)")) .show(30) {noformat} This yields
[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696321#comment-15696321 ] Herman van Hovell commented on SPARK-17788: --- That is fair. The solution is not that straightforward TBH: - Always add some kind of tie breaking value to the range. This could be random, but I'd rather add something like monotonically_increasing_id(). This always incurs some cost. - Only add a tie-breaker when the you have (suspect) skew. Here we need to add some heavy hitter algorithm, which is potentially much more resource intensive than reservoir sampling. The other thing is that when we suspect skew, we would need to scan the data again (which would make the total of scans 3). So I would be slightly in favor of option 1 and a flag to disable it. > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696321#comment-15696321 ] Herman van Hovell edited comment on SPARK-17788 at 11/25/16 5:09 PM: - That is fair. The solution is not that straightforward TBH: - Always add some kind of tie breaking value to the range. This could be random, but I'd rather add something like monotonically_increasing_id(). This always incurs some cost. - Only add a tie-breaker when the you have (suspect) skew. Here we need to add some heavy hitter algorithm, which is potentially much more resource intensive than reservoir sampling. The other thing is that when we suspect skew, we would need to scan the data again (which would make the total of scans 3). So I would be slightly in favor of option 1 and a flag to disable it. was (Author: hvanhovell): That is fair. The solution is not that straightforward TBH: - Always add some kind of tie breaking value to the range. This could be random, but I'd rather add something like monotonically_increasing_id(). This always incurs some cost. - Only add a tie-breaker when the you have (suspect) skew. Here we need to add some heavy hitter algorithm, which is potentially much more resource intensive than reservoir sampling. The other thing is that when we suspect skew, we would need to scan the data again (which would make the total of scans 3). So I would be slightly in favor of option 1 and a flag to disable it. > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18407) Inferred partition columns cause assertion error
[ https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696337#comment-15696337 ] Burak Yavuz commented on SPARK-18407: - This is also resolved as part of https://issues.apache.org/jira/browse/SPARK-18510 > Inferred partition columns cause assertion error > > > Key: SPARK-18407 > URL: https://issues.apache.org/jira/browse/SPARK-18407 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Michael Armbrust >Priority: Critical > > [This > assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408] > fails when you run a stream against json data that is stored in partitioned > folders, if you manually specify the schema and that schema omits the > partitioned columns. > My hunch is that we are inferring those columns even though the schema is > being passed in manually and adding them to the end. > While we are fixing this bug, it would be nice to make the assertion better. > Truncating is not terribly useful as, at least in my case, it truncated the > most interesting part. I changed it to this while debugging: > {code} > s""" > |Batch does not have expected schema > |Expected: ${output.mkString(",")} > |Actual: ${newPlan.output.mkString(",")} > | > |== Original == > |$logicalPlan > | > |== Batch == > |$newPlan >""".stripMargin > {code} > I also tried specifying the partition columns in the schema and now it > appears that they are filled with corrupted data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8
[ https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696373#comment-15696373 ] Apache Spark commented on SPARK-3359: - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16013 > `sbt/sbt unidoc` doesn't work with Java 8 > - > > Key: SPARK-3359 > URL: https://issues.apache.org/jira/browse/SPARK-3359 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Minor > > It seems that Java 8 is stricter on JavaDoc. I got many error messages like > {code} > [error] > /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2: > error: modifier private not allowed here > [error] private abstract interface SparkHadoopMapRedUtil { > [error] ^ > {code} > This is minor because we can always use Java 6/7 to generate the doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak
[ https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-18487. --- Resolution: Not A Problem > Add task completion listener to HashAggregate to avoid memory leak > -- > > Key: SPARK-18487 > URL: https://issues.apache.org/jira/browse/SPARK-18487 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The methods such as Dataset.show and take use Limit (CollectLimitExec) which > leverages SparkPlan.executeTake to efficiently collect required number of > elements back to the driver. > However, under wholestage codege, we usually release resources after all > elements are consumed (e.g., HashAggregate). In this case, we will not > release the resources and cause memory leak with Dataset.show, for example. > We can add task completion listener to HashAggregate to avoid the memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak
[ https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696458#comment-15696458 ] Reynold Xin commented on SPARK-18487: - As discussed on the pull request, this is not an issue. > Add task completion listener to HashAggregate to avoid memory leak > -- > > Key: SPARK-18487 > URL: https://issues.apache.org/jira/browse/SPARK-18487 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > The methods such as Dataset.show and take use Limit (CollectLimitExec) which > leverages SparkPlan.executeTake to efficiently collect required number of > elements back to the driver. > However, under wholestage codege, we usually release resources after all > elements are consumed (e.g., HashAggregate). In this case, we will not > release the resources and cause memory leak with Dataset.show, for example. > We can add task completion listener to HashAggregate to avoid the memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file
[ https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696541#comment-15696541 ] Herman van Hovell commented on SPARK-18220: --- I tried reproducing this but to no avail. [~jerryjung] could you give us a reproducible example? > ClassCastException occurs when using select query on ORC file > - > > Key: SPARK-18220 > URL: https://issues.apache.org/jira/browse/SPARK-18220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jerryjung > Labels: orcfile, sql > > Error message is below. > {noformat} > == > 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from > hdfs://xxx/part-00022 with {include: [true], offset: 0, length: > 9223372036854775807} > 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). > 1220 bytes result sent to driver > 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID > 42) in 116 ms on localhost (executor driver) (19/20) > 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID > 35) > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ORC dump info. > == > File Version: 0.12 with HIVE_8732 > 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from > hdfs://XXX/part-0 with {include: null, offset: 0, length: > 9223372036854775807} > 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on > read. Using file schema. > Rows: 7 > Compression: ZLIB > Compression size: 262144 > Type: > struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18543) SaveAsTable(CTAS) using overwrite could change table definition
[ https://issues.apache.org/jira/browse/SPARK-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696586#comment-15696586 ] Thomas Sebastian commented on SPARK-18543: -- I would like to take a look at this fix, if you have not already started. > SaveAsTable(CTAS) using overwrite could change table definition > --- > > Key: SPARK-18543 > URL: https://issues.apache.org/jira/browse/SPARK-18543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2 >Reporter: Xiao Li >Assignee: Xiao Li > > When the mode is OVERWRITE, we drop the Hive serde tables and create a data > source table. This is not right. > {code} > val tableName = "tab1" > withTable(tableName) { > sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS > key, 'abc' AS value") > val df = sql(s"SELECT key, value FROM $tableName") > df.write.mode(SaveMode.Overwrite).saveAsTable(tableName) > val tableMeta = > spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)) > assert(tableMeta.provider == > Some(spark.sessionState.conf.defaultDataSourceName)) > } > {code} > Based on the definition of OVERWRITE, no change should be made on the table > definition. When recreate the table, we need to create a Hive serde table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double
[ https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696598#comment-15696598 ] Thomas Sebastian commented on SPARK-18527: -- I am interested to work on this. > UDAFPercentile (bigint, array) needs explicity cast to double > - > > Key: SPARK-18527 > URL: https://issues.apache.org/jira/browse/SPARK-18527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell >Reporter: Fabian Boehnlein > > Same bug as SPARK-16228 but > {code}_FUNC_(bigint, array) {code} > instead of > {code}_FUNC_(bigint, double){code} > Fix of SPARK-16228 only fixes the non-array case that was hit. > {code} > sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)") > {code} > fails in Spark 2 shell. > Longer example > {code} > case class Record(key: Long, value: String) > val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, > s"val_$i"))) > recordsDF.createOrReplaceTempView("records") > sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, > 0.2, 0.1)) AS test FROM records") > org.apache.spark.sql.AnalysisException: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.had > oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible > choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164) > at > org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56) > at > org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18436) isin causing SQL syntax error with JDBC
[ https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18436. --- Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.1.0 2.0.3 > isin causing SQL syntax error with JDBC > --- > > Key: SPARK-18436 > URL: https://issues.apache.org/jira/browse/SPARK-18436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Linux, SQL Server 2012 >Reporter: Dan >Assignee: Jiang Xingbo > Labels: jdbc, sql > Fix For: 2.0.3, 2.1.0 > > > When using a JDBC data source, the "isin" function generates invalid SQL > syntax when called with an empty array, which causes the JDBC driver to throw > an exception. > If the array is not empty, it works fine. > In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and > TABLE are all correctly defined. > {noformat} > scala> val filter = Array[String]() > filter: Array[String] = Array() > scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> > SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> > TABLE)).load().filter($"cl_ult".isin(filter:_*)) > sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields] > scala> sortDF.show() > 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205) > com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'. > at > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350) > at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
Nicholas Chammas created SPARK-18589: Summary: persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child" Key: SPARK-18589 URL: https://issues.apache.org/jira/browse/SPARK-18589 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.0.2, 2.1.0 Environment: Python 3.5, Java 8 Reporter: Nicholas Chammas Priority: Minor Smells like another optimizer bug that's similar to SPARK-17100 and SPARK-18254. I'm seeing this on 2.0.2 and on master at commit {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. I don't have a minimal repro for this yet, but the error I'm seeing is: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires attributes from more than one child. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) at scala.collection.immutable.Stream.foreach(Stream.scala:594) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2555) at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMet
[jira] [Commented] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696717#comment-15696717 ] Nicholas Chammas commented on SPARK-18589: -- cc [~davies] [~hvanhovell] > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExec
[jira] [Created] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
Felix Cheung created SPARK-18590: Summary: R - Include package vignettes and help pages, build source package in Spark distribution Key: SPARK-18590 URL: https://issues.apache.org/jira/browse/SPARK-18590 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung Assignee: Felix Cheung Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18590: - Description: We should include in Spark distribution the built source package for SparkR. This will enable help and vignettes when the package is used. Also this source package is what we would release to CRAN. > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696848#comment-15696848 ] Apache Spark commented on SPARK-18590: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16014 > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18590: Assignee: Apache Spark (was: Felix Cheung) > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18590: Assignee: Felix Cheung (was: Apache Spark) > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Blocker > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696978#comment-15696978 ] Jeff Zhang commented on SPARK-18405: I think he mean to launch multiple spark thrift server in yarn-cluster mode so each we can evenly distribute workload across them. And it is much easier to request large container for AM in yarn-cluster while the the memory of the host in yarn-client mode may be limited and without capacity control under yarn. > Add yarn-cluster mode support to Spark Thrift Server > > > Key: SPARK-18405 > URL: https://issues.apache.org/jira/browse/SPARK-18405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1 >Reporter: Prabhu Kasinathan > Labels: Spark, ThriftServer2 > > Currently, spark thrift server can run only on yarn-client mode. > Can we add Yarn-Cluster mode support to spark thrift server? > This will help us to launch multiple spark thrift server with different spark > configurations and it really help in large distributed clusters where there > is requirement to run complex sqls through STS. With client mode, there is a > chance to overload local host with too much driver memory. > Please let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696978#comment-15696978 ] Jeff Zhang edited comment on SPARK-18405 at 11/26/16 1:01 AM: -- I think he mean to launch multiple spark thrift server in yarn-cluster mode so each we can evenly distribute workload across them(e.g. one spark thrift server per department). And it is much easier to request large container for AM in yarn-cluster while the the memory of the host in yarn-client mode may be limited and without capacity control under yarn. was (Author: zjffdu): I think he mean to launch multiple spark thrift server in yarn-cluster mode so each we can evenly distribute workload across them. And it is much easier to request large container for AM in yarn-cluster while the the memory of the host in yarn-client mode may be limited and without capacity control under yarn. > Add yarn-cluster mode support to Spark Thrift Server > > > Key: SPARK-18405 > URL: https://issues.apache.org/jira/browse/SPARK-18405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1 >Reporter: Prabhu Kasinathan > Labels: Spark, ThriftServer2 > > Currently, spark thrift server can run only on yarn-client mode. > Can we add Yarn-Cluster mode support to spark thrift server? > This will help us to launch multiple spark thrift server with different spark > configurations and it really help in large distributed clusters where there > is requirement to run complex sqls through STS. With client mode, there is a > chance to overload local host with too much driver memory. > Please let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
Takeshi Yamamuro created SPARK-18591: Summary: Replace hash-based aggregates with sort-based ones if inputs already sorted Key: SPARK-18591 URL: https://issues.apache.org/jira/browse/SPARK-18591 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Takeshi Yamamuro Spark currently uses sort-based aggregates only in limited condition; the cases where spark cannot use partial aggregates and hash-based ones. However, if input ordering has already satisfied the requirements of sort-based aggregates, it seems sort-based ones are faster than the other. {code} ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS value").sort($"key").cache def timer[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") result } timer { df.groupBy("key").count().count } // codegen'd hash aggregate Elapsed time: 7.116962977s // non-codegen'd sort aggregarte Elapsed time: 3.088816662s {code} If codegen'd sort-based aggregates are supported in SPARK-16844, this seems to make the performance gap bigger; {code} - codegen'd sort aggregate Elapsed time: 1.645234684s {code} Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15697111#comment-15697111 ] Takeshi Yamamuro commented on SPARK-18591: -- If it's worth trying this, I'll do. I just made a prototype here; https://github.com/maropu/spark/commit/32b716cf02dfe8cba5b08b2dc3297bc061156630#diff-7d06cf071190dcbeda2fed6b039ec5d0R55 > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator
[ https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15697176#comment-15697176 ] Apache Spark commented on SPARK-17251: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16015 > "ClassCastException: OuterReference cannot be cast to NamedExpression" for > correlated subquery on the RHS of an IN operator > --- > > Key: SPARK-17251 > URL: https://issues.apache.org/jira/browse/SPARK-17251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen > > The following test case produces a ClassCastException in the analyzer: > {code} > CREATE TABLE t1(a INTEGER); > INSERT INTO t1 VALUES(1),(2); > CREATE TABLE t2(b INTEGER); > INSERT INTO t2 VALUES(1); > SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2); > {code} > Here's the exception: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to > org.apache.spark.sql.catalyst.expressions.NamedExpression > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44) > at > org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.