[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212848#comment-15212848 ] Jeff Zhang commented on SPARK-13587: bq. Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This problem is most of time you are not administrator and don't have permission to do that. It's inefficient to ask your administrator to install the environment for you. bq. "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. Some packages are binary and need to compile. And it is not easy to do dependency management in this way. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14177) Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES"
[ https://issues.apache.org/jira/browse/SPARK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14177: Description: Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL The syntax of DDL command for {{ALTER DATABASE}} is {code} ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...) {code} {{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES The syntax of DDL command for {{DESCRIBE DATABASE}} is {code} DESCRIBE DATABASE [EXTENDED] db_name {code} {{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When extended is true, it also shows the database's properties was: Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL The syntax of DDL command for ALTER DATABASE is {code} ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...) {code} {{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES The syntax of DDL command for DESCRIBE DATABASE is {code} DESCRIBE DATABASE [EXTENDED] db_name {code} {{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When extended is true, it also shows the database's properties > Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES" > > > Key: SPARK-14177 > URL: https://issues.apache.org/jira/browse/SPARK-14177 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Based on the Hive DDL document > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL > The syntax of DDL command for {{ALTER DATABASE}} is > {code} > ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES > (property_name=property_value, ...) > {code} > {{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES > The syntax of DDL command for {{DESCRIBE DATABASE}} is > {code} > DESCRIBE DATABASE [EXTENDED] db_name > {code} > {{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has > been set), and its root location on the filesystem. When extended is true, it > also shows the database's properties -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14177) Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES"
Xiao Li created SPARK-14177: --- Summary: Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES" Key: SPARK-14177 URL: https://issues.apache.org/jira/browse/SPARK-14177 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL The syntax of DDL command for ALTER DATABASE is {code} ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...) {code} {{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES The syntax of DDL command for DESCRIBE DATABASE is {code} DESCRIBE DATABASE [EXTENDED] db_name {code} {{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When extended is true, it also shows the database's properties -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14176) Add processing time trigger
[ https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212828#comment-15212828 ] Apache Spark commented on SPARK-14176: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11976 > Add processing time trigger > --- > > Key: SPARK-14176 > URL: https://issues.apache.org/jira/browse/SPARK-14176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Add a processing time trigger to control the batch processing speed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14176) Add processing time trigger
[ https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14176: Assignee: Shixiong Zhu (was: Apache Spark) > Add processing time trigger > --- > > Key: SPARK-14176 > URL: https://issues.apache.org/jira/browse/SPARK-14176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Add a processing time trigger to control the batch processing speed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14176) Add processing time trigger
[ https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14176: Assignee: Apache Spark (was: Shixiong Zhu) > Add processing time trigger > --- > > Key: SPARK-14176 > URL: https://issues.apache.org/jira/browse/SPARK-14176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Add a processing time trigger to control the batch processing speed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14176) Add processing time trigger
Shixiong Zhu created SPARK-14176: Summary: Add processing time trigger Key: SPARK-14176 URL: https://issues.apache.org/jira/browse/SPARK-14176 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Add a processing time trigger to control the batch processing speed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14175) Simplify whole stage codegen interface
[ https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14175: Assignee: Apache Spark (was: Davies Liu) > Simplify whole stage codegen interface > -- > > Key: SPARK-14175 > URL: https://issues.apache.org/jira/browse/SPARK-14175 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > 1. remove consumeChild > 2. always create code for UnsafeRow and variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14175) Simplify whole stage codegen interface
[ https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212820#comment-15212820 ] Apache Spark commented on SPARK-14175: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11975 > Simplify whole stage codegen interface > -- > > Key: SPARK-14175 > URL: https://issues.apache.org/jira/browse/SPARK-14175 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > 1. remove consumeChild > 2. always create code for UnsafeRow and variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14175) Simplify whole stage codegen interface
[ https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14175: Assignee: Davies Liu (was: Apache Spark) > Simplify whole stage codegen interface > -- > > Key: SPARK-14175 > URL: https://issues.apache.org/jira/browse/SPARK-14175 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > 1. remove consumeChild > 2. always create code for UnsafeRow and variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14175) Simplify whole stage codegen interface
Davies Liu created SPARK-14175: -- Summary: Simplify whole stage codegen interface Key: SPARK-14175 URL: https://issues.apache.org/jira/browse/SPARK-14175 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu 1. remove consumeChild 2. always create code for UnsafeRow and variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212776#comment-15212776 ] zhengruifeng commented on SPARK-14174: -- There is another sklean example for MiniBatch KMeans: http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212772#comment-15212772 ] Apache Spark commented on SPARK-14174: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11974 > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14174: Assignee: (was: Apache Spark) > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14174: Assignee: Apache Spark > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
zhengruifeng created SPARK-14174: Summary: Accelerate KMeans via Mini-Batch EM Key: SPARK-14174 URL: https://issues.apache.org/jira/browse/SPARK-14174 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Minor The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant. The MiniBatch KMeans is named XMeans in following lines. val path = "/tmp/mnist8m.scale" val data = MLUtils.loadLibSVMFile(sc, path) val vecs = data.map(_.features).persist() val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", seed=123l) km.computeCost(vecs) res0: Double = 3.317029898599564E8 val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) xm.computeCost(vecs) res1: Double = 3.3169865959604424E8 val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) xm2.computeCost(vecs) res2: Double = 3.317195831216454E8 The above three training all reached the max number of iterations 10. We can see that the WSSSEs are almost the same. While their speed perfermence have significant difference: KMeans2876sec MiniBatch KMeans (fraction=0.1) 263sec MiniBatch KMeans (fraction=0.01) 90sec With appropriate fraction, the bigger the dataset is, the higher speedup is. The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n
[ https://issues.apache.org/jira/browse/SPARK-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-14109. --- Resolution: Fixed > HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like > s3n > - > > Key: SPARK-14109 > URL: https://issues.apache.org/jira/browse/SPARK-14109 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Assignee: Tathagata Das > > HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. > However, FileContext implementations may not exist for many scheme for which > there may be FileSystem implementations. In those cases, rather than failing > completely, we should fallback to the FileSystem based implementation, and > log warning that there may be file consistency issues in case the log > directory is concurrently modified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14139) Dataset loses nullability in operations with RowEncoder
[ https://issues.apache.org/jira/browse/SPARK-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212743#comment-15212743 ] koert kuipers commented on SPARK-14139: --- it is not clear to me if the goal should remain to derive the schema from the logical plan, or revert back to a schema from the encoder. i am going to assume a schema from logical plan, and will try to fix nullable for the logical plan. > Dataset loses nullability in operations with RowEncoder > --- > > Key: SPARK-14139 > URL: https://issues.apache.org/jira/browse/SPARK-14139 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: koert kuipers >Priority: Minor > > When i do > {noformat} > val df1 = sc.makeRDD(1 to 3).toDF > val df2 = df1.map(row => Row(row(0).asInstanceOf[Int] + > 1))(RowEncoder(df1.schema)) > println(s"schema before ${df1.schema} and after ${df2.schema}") > {noformat} > I get: > {noformat} > schema before StructType(StructField(value,IntegerType,false)) and after > StructType(StructField(value,IntegerType,true)) > {noformat} > The change in field nullable is unexpected and i consider it a bug. > This bug was introduced in: > [SPARK-13244][SQL] Migrates DataFrame to Dataset -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”
[ https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-14173: -- Description: when i submit streaming application on *yarn cluster* mode, i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning: {quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote} was: when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning: {quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote} > Ignoring config property “spark.executor.extraJavaOptions” > -- > > Key: SPARK-14173 > URL: https://issues.apache.org/jira/browse/SPARK-14173 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN >Affects Versions: 1.5.2 >Reporter: liyan > > when i submit streaming application on *yarn cluster* mode, i can't find > "spark.executor.extraJavaOptions" in Spark UI. > the log prints warning: > {quote}Ignoring none-spark config property : > "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”
[ https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-14173: -- Description: when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning: {quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote} was: when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning:Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log" > Ignoring config property “spark.executor.extraJavaOptions” > -- > > Key: SPARK-14173 > URL: https://issues.apache.org/jira/browse/SPARK-14173 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN >Affects Versions: 1.5.2 >Reporter: liyan > > when i submit streaming application on *yarn cluster* , i can't find > "spark.executor.extraJavaOptions" in Spark UI. > the log prints warning: > {quote}Ignoring none-spark config property : > "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”
[ https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-14173: -- Description: when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning:Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log" was: when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning:Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log" > Ignoring config property “spark.executor.extraJavaOptions” > -- > > Key: SPARK-14173 > URL: https://issues.apache.org/jira/browse/SPARK-14173 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN >Affects Versions: 1.5.2 >Reporter: liyan > > when i submit streaming application on *yarn cluster* , i can't find > "spark.executor.extraJavaOptions" in Spark UI. > the log prints warning:Ignoring none-spark config property : > "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
[ https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212726#comment-15212726 ] Jack Franson commented on SPARK-4743: - Hi, I'm still running into "Task not serializable" errors with aggregateByKey when my initial value is an Avro object (I've configured the Kryo serializer to handle it via a custom KryoRegistrator). I don't see the issue when I pass in an empty instance of the Avro object (via new MyAvroObject()), as the initial value, but then I get an exception from the Kryo serializer since required fields are null. To get around that, I tried creating an initial value with all the required fields set to defaults, but then I was hit with a java.io.NotSerializableException causing the "Task not serializable" exception to fail the job, which seems to indicate that Java serialization is taking over again. This is on Spark 1.5.2 with Avro 1.7.7. The line throwing the fatal Exception is org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304). This is the closest ticket I could find around this issue, so I'm wondering if there are further tweaks that the Spark libraries can make to use the SparkEnv.serializer, or if the problem is on my end (any tips in that case would be much appreciated!). Thanks for your help. > Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and > foldByKey > > > Key: SPARK-4743 > URL: https://issues.apache.org/jira/browse/SPARK-4743 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ivan Vergiliev >Assignee: Ivan Vergiliev > Labels: performance > Fix For: 1.3.0 > > > AggregateByKey and foldByKey in PairRDDFunctions both use the closure > serializer to serialize and deserialize the initial value. This means that > the Java serializer is always used, which can be very expensive if there's a > large number of groups. Calling combineByKey manually and using the normal > serializer instead of the closure one improved the performance on the dataset > I'm testing with by about 30-35%. > I'm not familiar enough with the codebase to be certain that replacing the > serializer here is OK, but it works correctly in my tests, and it's only > serializing a single value of type U, which should be serializable by the > default one since it can be the output of a job. Let me know if I'm missing > anything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”
liyan created SPARK-14173: - Summary: Ignoring config property “spark.executor.extraJavaOptions” Key: SPARK-14173 URL: https://issues.apache.org/jira/browse/SPARK-14173 Project: Spark Issue Type: Bug Components: Streaming, YARN Affects Versions: 1.5.2 Reporter: liyan when i submit streaming application on *yarn cluster* , i can't find "spark.executor.extraJavaOptions" in Spark UI. the log prints warning:Ignoring none-spark config property : "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yingji Zhang updated SPARK-14172: - Description: When the hive sql contains nondeterministic fields, spark plan will not push down the partition predicate to the HiveTableScan. For example: {code} -- consider following query which uses a random function to sample rows SELECT * FROM table_a WHERE partition_col = 'some_value' AND rand() < 0.01; {code} The spark plan will not push down the partition predicate to HiveTableScan which ends up scanning all partitions data from the table. was: When the hive sql contains nondeterministic fields, spark plan will not push down the partition predicate to the HiveTableScan. For example: -- consider following query which uses a random function to sample rows SELECT * FROM table_a WHERE partition_col = 'some_value' AND rand() < 0.01; The spark plan will not push down the partition predicate to HiveTableScan which ends up scanning all partitions data from the table. > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly
[ https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianfeng Hu updated SPARK-14171: Description: For example, when using percentile_approx and count distinct together, it raises an error complaining the argument is not constant. We have a test case to reproduce. Could you help look into a fix of this? This was working in previous version (Spark 1.4 + Hive 0.13). Thanks! {code}--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton with SQLTestUtils { checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src LIMIT 1"), sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq) + +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct key) FROM src LIMIT 1"), + sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq) } test("UDFIntegerToString") { {code} When running the test suite, we can see this error: {code} - Generic UDAF aggregates *** FAILED *** org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ... Cause: java.lang.reflect.InvocationTargetException: at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second argument must be a constant, but double was passed instead. at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ... {code} was: For example, when using percentile_approx and count distinct together, it raises an error complaining the argument is not constant. We have a test case to reproduce. Could you help look into a fix of this? This was working in previous version (Spark 1.4 + Hive 0.13). Thanks! {{--- a/sql/hive/src/test/scala/org/ap
[jira] [Updated] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly
[ https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianfeng Hu updated SPARK-14171: Description: For example, when using percentile_approx and count distinct together, it raises an error complaining the argument is not constant. We have a test case to reproduce. Could you help look into a fix of this? This was working in previous version (Spark 1.4 + Hive 0.13). Thanks! {{--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton with SQLTestUtils { checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src LIMIT 1"), sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq) + +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct key) FROM src LIMIT 1"), + sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq) } test("UDFIntegerToString") { }} When running the test suite, we can see this error: {{ - Generic UDAF aggregates *** FAILED *** org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ... Cause: java.lang.reflect.InvocationTargetException: at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second argument must be a constant, but double was passed instead. at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ... }} was: For example, when using percentile_approx and count distinct together, it raises an error complaining the argument is not constant. We have a test case to reproduce. Could you help look into a fix of this? This was working in previous version (Spark 1.4 + Hive 0.13). Thanks! ```--- a/sql/hive/src/test/scala/org/apache/spark/sql/
[jira] [Created] (SPARK-14172) Hive table partition predicate not passed down correctly
Yingji Zhang created SPARK-14172: Summary: Hive table partition predicate not passed down correctly Key: SPARK-14172 URL: https://issues.apache.org/jira/browse/SPARK-14172 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Yingji Zhang Priority: Critical When the hive sql contains nondeterministic fields, spark plan will not push down the partition predicate to the HiveTableScan. For example: -- consider following query which uses a random function to sample rows SELECT * FROM table_a WHERE partition_col = 'some_value' AND rand() < 0.01; The spark plan will not push down the partition predicate to HiveTableScan which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly
Jianfeng Hu created SPARK-14171: --- Summary: UDAF aggregates argument object inspector not parsed correctly Key: SPARK-14171 URL: https://issues.apache.org/jira/browse/SPARK-14171 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Jianfeng Hu Priority: Blocker For example, when using percentile_approx and count distinct together, it raises an error complaining the argument is not constant. We have a test case to reproduce. Could you help look into a fix of this? This was working in previous version (Spark 1.4 + Hive 0.13). Thanks! ```--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton with SQLTestUtils { checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src LIMIT 1"), sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq) + +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct key) FROM src LIMIT 1"), + sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq) } test("UDFIntegerToString") {``` When running the test suite, we can see this error: ``` - Generic UDAF aggregates *** FAILED *** org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192) at org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ... Cause: java.lang.reflect.InvocationTargetException: at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second argument must be a constant, but double was passed instead. at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598) at org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606) at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ... ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType
[ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212671#comment-15212671 ] Reynold Xin commented on SPARK-12436: - I don't have everything page in, but why isn't an empty string just a string type? > If all values of a JSON field is null, JSON's inferSchema should return > NullType instead of StringType > -- > > Key: SPARK-12436 > URL: https://issues.apache.org/jira/browse/SPARK-12436 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > Labels: starter > > Right now, JSON's inferSchema will return {{StringType}} for a field that > always has null values or an {{ArrayType(StringType)}} for a field that > always has empty array values. Although this behavior makes writing JSON data > to other data sources easy (i.e. when writing data, we do not need to remove > those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream > application hard to reason about the actual schema of the data and thus makes > schema merging hard. We should allow JSON's inferSchema returns {{NullType}} > and {{ArrayType(NullType)}}. Also, we need to make sure that when we write > data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} > columns first. > Besides {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same > thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). > To finish this work, we need to finish the following sub-tasks: > * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. > * Determine whether we need to add the operation of removing {{NullType}} and > {{ArrayType(NullType)}} columns from the data that will be write out for all > data sources (i.e. data sources based our data source API and Hive tables). > Or, we should just add this operation for certain data sources (e.g. > Parquet). For example, we may not need this operation for Hive because Hive > has VoidObjectInspector. > * Implement the change and get it merged to Spark master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
[ https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212670#comment-15212670 ] Reynold Xin commented on SPARK-1153: [~ntietz] changing this will very likely make performance regress for long ids, due to the lack of specialization. You might want to look into graphframes for more general graph functionalities too: https://github.com/graphframes/graphframes > Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs. > -- > > Key: SPARK-1153 > URL: https://issues.apache.org/jira/browse/SPARK-1153 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 0.9.0 >Reporter: Deepak Nulu > > Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be > able to use {{UUID}} as the vertex ID type because the data I want to process > with GraphX uses that type for its primay-keys. Others might have a different > type for their primary-keys. Generalizing {{VertexId}} (with a type class) > will help in such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14073) Move streaming-flume back to Spark
[ https://issues.apache.org/jira/browse/SPARK-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14073. - Resolution: Fixed Fix Version/s: 2.0.0 > Move streaming-flume back to Spark > -- > > Key: SPARK-14073 > URL: https://issues.apache.org/jira/browse/SPARK-14073 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14170) Remove the PR template before pushing changes
[ https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14170: Assignee: Apache Spark > Remove the PR template before pushing changes > - > > Key: SPARK-14170 > URL: https://issues.apache.org/jira/browse/SPARK-14170 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > As (briefly) discussed in the mailing list, it would be nice to not include > the PR template in every commit message (when people forget to delete it). > This can be done by making some small changes to the template, and update the > merge script used by committers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14170) Remove the PR template before pushing changes
[ https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212653#comment-15212653 ] Apache Spark commented on SPARK-14170: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/11973 > Remove the PR template before pushing changes > - > > Key: SPARK-14170 > URL: https://issues.apache.org/jira/browse/SPARK-14170 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > As (briefly) discussed in the mailing list, it would be nice to not include > the PR template in every commit message (when people forget to delete it). > This can be done by making some small changes to the template, and update the > merge script used by committers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14170) Remove the PR template before pushing changes
[ https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14170: Assignee: (was: Apache Spark) > Remove the PR template before pushing changes > - > > Key: SPARK-14170 > URL: https://issues.apache.org/jira/browse/SPARK-14170 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > As (briefly) discussed in the mailing list, it would be nice to not include > the PR template in every commit message (when people forget to delete it). > This can be done by making some small changes to the template, and update the > merge script used by committers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14170) Remove the PR template before pushing changes
Marcelo Vanzin created SPARK-14170: -- Summary: Remove the PR template before pushing changes Key: SPARK-14170 URL: https://issues.apache.org/jira/browse/SPARK-14170 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Minor As (briefly) discussed in the mailing list, it would be nice to not include the PR template in every commit message (when people forget to delete it). This can be done by making some small changes to the template, and update the merge script used by committers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14013) Properly implement temporary functions in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14013: Assignee: (was: Apache Spark) > Properly implement temporary functions in SessionCatalog > > > Key: SPARK-14013 > URL: https://issues.apache.org/jira/browse/SPARK-14013 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or > > Right now `SessionCatalog` just contains `CatalogFunction`, which is > metadata. In the future the catalog should probably take in a function > registry or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14013) Properly implement temporary functions in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212634#comment-15212634 ] Apache Spark commented on SPARK-14013: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11972 > Properly implement temporary functions in SessionCatalog > > > Key: SPARK-14013 > URL: https://issues.apache.org/jira/browse/SPARK-14013 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or > > Right now `SessionCatalog` just contains `CatalogFunction`, which is > metadata. In the future the catalog should probably take in a function > registry or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14013) Properly implement temporary functions in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14013: Assignee: Apache Spark > Properly implement temporary functions in SessionCatalog > > > Key: SPARK-14013 > URL: https://issues.apache.org/jira/browse/SPARK-14013 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > > Right now `SessionCatalog` just contains `CatalogFunction`, which is > metadata. In the future the catalog should probably take in a function > registry or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13955) Spark in yarn mode fails
[ https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13955: Assignee: Apache Spark > Spark in yarn mode fails > > > Key: SPARK-13955 > URL: https://issues.apache.org/jira/browse/SPARK-13955 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Apache Spark > > I ran spark-shell in yarn client, but from the logs seems the spark assembly > jar is not uploaded to HDFS. This may be known issue in the process of > SPARK-11157, create this ticket to track this issue. [~vanzin] > {noformat} > 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM > 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM > container > 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container > 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 16/03/17 17:57:48 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip > -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip > 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jzhang); users > with modify permissions: Set(jzhang) > 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager > {noformat} > message in AM container > {noformat} > Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13955) Spark in yarn mode fails
[ https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212609#comment-15212609 ] Apache Spark commented on SPARK-13955: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/11970 > Spark in yarn mode fails > > > Key: SPARK-13955 > URL: https://issues.apache.org/jira/browse/SPARK-13955 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > I ran spark-shell in yarn client, but from the logs seems the spark assembly > jar is not uploaded to HDFS. This may be known issue in the process of > SPARK-11157, create this ticket to track this issue. [~vanzin] > {noformat} > 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM > 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM > container > 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container > 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 16/03/17 17:57:48 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip > -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip > 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jzhang); users > with modify permissions: Set(jzhang) > 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager > {noformat} > message in AM container > {noformat} > Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13955) Spark in yarn mode fails
[ https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13955: Assignee: (was: Apache Spark) > Spark in yarn mode fails > > > Key: SPARK-13955 > URL: https://issues.apache.org/jira/browse/SPARK-13955 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > I ran spark-shell in yarn client, but from the logs seems the spark assembly > jar is not uploaded to HDFS. This may be known issue in the process of > SPARK-11157, create this ticket to track this issue. [~vanzin] > {noformat} > 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM > 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM > container > 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container > 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 16/03/17 17:57:48 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip > -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip > 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jzhang); users > with modify permissions: Set(jzhang) > 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager > {noformat} > message in AM container > {noformat} > Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14169) Add UninterruptibleThread
Shixiong Zhu created SPARK-14169: Summary: Add UninterruptibleThread Key: SPARK-14169 URL: https://issues.apache.org/jira/browse/SPARK-14169 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14169) Add UninterruptibleThread
[ https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212608#comment-15212608 ] Apache Spark commented on SPARK-14169: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11971 > Add UninterruptibleThread > - > > Key: SPARK-14169 > URL: https://issues.apache.org/jira/browse/SPARK-14169 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Extract the workaround for HADOOP-10622 introduced by #11940 into > UninterruptibleThread so that we can test and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14169) Add UninterruptibleThread
[ https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14169: Assignee: Apache Spark (was: Shixiong Zhu) > Add UninterruptibleThread > - > > Key: SPARK-14169 > URL: https://issues.apache.org/jira/browse/SPARK-14169 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Extract the workaround for HADOOP-10622 introduced by #11940 into > UninterruptibleThread so that we can test and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14169) Add UninterruptibleThread
[ https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14169: Assignee: Shixiong Zhu (was: Apache Spark) > Add UninterruptibleThread > - > > Key: SPARK-14169 > URL: https://issues.apache.org/jira/browse/SPARK-14169 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Extract the workaround for HADOOP-10622 introduced by #11940 into > UninterruptibleThread so that we can test and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212590#comment-15212590 ] holdenk commented on SPARK-14141: - So with RDDs there is `toLocalIterator` which you could use to do this (although you should make sure your input is cached first). > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212587#comment-15212587 ] holdenk commented on SPARK-14141: - The more I look at this the more I think its not a good fit for Spark. > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212563#comment-15212563 ] Luke Miner commented on SPARK-14141: Is there any way to do this process in chunks: read a chunk of data into a dict and then append to a pandas dataframe with the pre-specified datatypes? The big advantage of a pandas dataframe with categorical datatypes is that it can potentially have a much much smaller memory footprint. However, if everything is loaded into a huge dict beforehand, there's much less of an upside. > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14159: -- Target Version/s: 1.6.2, 2.0.0 (was: 2.0.0) > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-14159: --- > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14159: Assignee: Joseph K. Bradley (was: Apache Spark) > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14159: Assignee: Apache Spark (was: Joseph K. Bradley) > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0 > > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14159. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11965 [https://github.com/apache/spark/pull/11965] > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212548#comment-15212548 ] holdenk commented on SPARK-14141: - So following up, `from_records` doesn't take dtypes although we could convert to a dict "like" as an intermediate step and use `from_dicts` if the user specified specific types. Its less clear to me that this is what we want to do, but I'll make a WIP PR so we can take a look and see if it looks like a reasonable change or something we'd rather not expose. > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning
[ https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14168: Assignee: Imran Rashid (was: Apache Spark) > Managed Memory Leak Msg Should Only Be a Warning > > > Key: SPARK-14168 > URL: https://issues.apache.org/jira/browse/SPARK-14168 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > When a task is completed, executors check to see if all managed memory for > the task was correctly released, and logs an error when it wasn't. However, > it turns out its OK for there to be memory that wasn't released when an > Iterator isn't read to completion, eg., with {{rdd.take()}}. This results in > a scary error msg in the executor logs: > {noformat} > 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = > 16259594 bytes, TID = 24 > {noformat} > Furthermore, if tasks fails for any reason, this msg is also triggered. This > can lead users to believe that the failure was from the memory leak, when the > root cause could be entirely different. Eg., the same error msg appears in > executor logs with this clearly broken user code run with {{spark-shell > --master 'local-cluster[2,2,1024]'}} > {code} > sc.parallelize(0 to 1000, 2).map(x => x % 1 -> > x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") > }.collect > {code} > We should downgrade the msg to a warning and link to a more detailed > explanation. > See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from > users (and perhaps a true fix) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning
[ https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212501#comment-15212501 ] Apache Spark commented on SPARK-14168: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/11969 > Managed Memory Leak Msg Should Only Be a Warning > > > Key: SPARK-14168 > URL: https://issues.apache.org/jira/browse/SPARK-14168 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > When a task is completed, executors check to see if all managed memory for > the task was correctly released, and logs an error when it wasn't. However, > it turns out its OK for there to be memory that wasn't released when an > Iterator isn't read to completion, eg., with {{rdd.take()}}. This results in > a scary error msg in the executor logs: > {noformat} > 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = > 16259594 bytes, TID = 24 > {noformat} > Furthermore, if tasks fails for any reason, this msg is also triggered. This > can lead users to believe that the failure was from the memory leak, when the > root cause could be entirely different. Eg., the same error msg appears in > executor logs with this clearly broken user code run with {{spark-shell > --master 'local-cluster[2,2,1024]'}} > {code} > sc.parallelize(0 to 1000, 2).map(x => x % 1 -> > x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") > }.collect > {code} > We should downgrade the msg to a warning and link to a more detailed > explanation. > See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from > users (and perhaps a true fix) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212497#comment-15212497 ] Imran Rashid edited comment on SPARK-11293 at 3/25/16 10:27 PM: I've seen a few people misled by the error msg, so I'd like to try to downgrade that. I've created a separate ticket [SPARK-141868 | https://issues.apache.org/jira/browse/SPARK-14168] just for changing the msg, in case there is some fix in store here. was (Author: irashid): I've seen a few people misled by the error msg, so I'd like to try to downgrade that. I've created a separate ticket just for changing the msg, in case there is some fix in store here. > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning
[ https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14168: Assignee: Apache Spark (was: Imran Rashid) > Managed Memory Leak Msg Should Only Be a Warning > > > Key: SPARK-14168 > URL: https://issues.apache.org/jira/browse/SPARK-14168 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Minor > > When a task is completed, executors check to see if all managed memory for > the task was correctly released, and logs an error when it wasn't. However, > it turns out its OK for there to be memory that wasn't released when an > Iterator isn't read to completion, eg., with {{rdd.take()}}. This results in > a scary error msg in the executor logs: > {noformat} > 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = > 16259594 bytes, TID = 24 > {noformat} > Furthermore, if tasks fails for any reason, this msg is also triggered. This > can lead users to believe that the failure was from the memory leak, when the > root cause could be entirely different. Eg., the same error msg appears in > executor logs with this clearly broken user code run with {{spark-shell > --master 'local-cluster[2,2,1024]'}} > {code} > sc.parallelize(0 to 1000, 2).map(x => x % 1 -> > x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") > }.collect > {code} > We should downgrade the msg to a warning and link to a more detailed > explanation. > See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from > users (and perhaps a true fix) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212497#comment-15212497 ] Imran Rashid commented on SPARK-11293: -- I've seen a few people misled by the error msg, so I'd like to try to downgrade that. I've created a separate ticket just for changing the msg, in case there is some fix in store here. > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning
Imran Rashid created SPARK-14168: Summary: Managed Memory Leak Msg Should Only Be a Warning Key: SPARK-14168 URL: https://issues.apache.org/jira/browse/SPARK-14168 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor When a task is completed, executors check to see if all managed memory for the task was correctly released, and logs an error when it wasn't. However, it turns out its OK for there to be memory that wasn't released when an Iterator isn't read to completion, eg., with {{rdd.take()}}. This results in a scary error msg in the executor logs: {noformat} 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 16259594 bytes, TID = 24 {noformat} Furthermore, if tasks fails for any reason, this msg is also triggered. This can lead users to believe that the failure was from the memory leak, when the root cause could be entirely different. Eg., the same error msg appears in executor logs with this clearly broken user code run with {{spark-shell --master 'local-cluster[2,2,1024]'}} {code} sc.parallelize(0 to 1000, 2).map(x => x % 1 -> x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") }.collect {code} We should downgrade the msg to a warning and link to a more detailed explanation. See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from users (and perhaps a true fix) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14091) Improve performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-14091. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11911 [https://github.com/apache/spark/pull/11911] > Improve performance of SparkContext.getCallSite() > - > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 2.0.0 > > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14091) Improve performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14091: --- Summary: Improve performance of SparkContext.getCallSite() (was: Consider improving performance of SparkContext.getCallSite()) > Improve performance of SparkContext.getCallSite() > - > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14091: --- Assignee: Rajesh Balamohan > Consider improving performance of SparkContext.getCallSite() > > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14167) Remove redundant `return` in Scala code
[ https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-14167. --- Resolution: Won't Fix I close this issue based on the comment of [~joshrosen]. "I'm actually wary of this change. There's a number of times where I've found redundant returns but have chosen not to remove them because I was afraid of future code refactorings or changes accidentally making the implicit return no longer take effect. If this isn't strictly necessary, I'd prefer to not do this cleanup." > Remove redundant `return` in Scala code > --- > > Key: SPARK-14167 > URL: https://issues.apache.org/jira/browse/SPARK-14167 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > Spark Scala code takes advantage of `return` statement as a control flow in > many cases, but it does not mean *redundant* `return` statements are needed. > This issue tries to remove redundant `return` statement in Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs
[ https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212442#comment-15212442 ] Gayathri Murali commented on SPARK-13783: - Thanks [~josephkb]. I can go first, as I am almost done making changes. I could definitely review [~yanboliang]'s code and would really appreciate the same help. > Model export/import for spark.ml: GBTs > -- > > Key: SPARK-13783 > URL: https://issues.apache.org/jira/browse/SPARK-13783 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for both GBTClassifier and GBTRegressor. The implementation > should reuse the one for DecisionTree*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14167) Remove redundant `return` in Scala code
[ https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212441#comment-15212441 ] Apache Spark commented on SPARK-14167: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11968 > Remove redundant `return` in Scala code > --- > > Key: SPARK-14167 > URL: https://issues.apache.org/jira/browse/SPARK-14167 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > Spark Scala code takes advantage of `return` statement as a control flow in > many cases, but it does not mean *redundant* `return` statements are needed. > This issue tries to remove redundant `return` statement in Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14167) Remove redundant `return` in Scala code
[ https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14167: Assignee: Apache Spark > Remove redundant `return` in Scala code > --- > > Key: SPARK-14167 > URL: https://issues.apache.org/jira/browse/SPARK-14167 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > > Spark Scala code takes advantage of `return` statement as a control flow in > many cases, but it does not mean *redundant* `return` statements are needed. > This issue tries to remove redundant `return` statement in Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14167) Remove redundant `return` in Scala code
[ https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14167: Assignee: (was: Apache Spark) > Remove redundant `return` in Scala code > --- > > Key: SPARK-14167 > URL: https://issues.apache.org/jira/browse/SPARK-14167 > Project: Spark > Issue Type: Task >Reporter: Dongjoon Hyun >Priority: Trivial > > Spark Scala code takes advantage of `return` statement as a control flow in > many cases, but it does not mean *redundant* `return` statements are needed. > This issue tries to remove redundant `return` statement in Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212438#comment-15212438 ] holdenk commented on SPARK-13842: - This makes some additional sense when we consider that `StructType` in Scala has an `apply` function which (when given a single column name) returns the corresponding `StructField` - so part of this could be viewed as API parity. > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14167) Remove redundant `return` in Scala code
Dongjoon Hyun created SPARK-14167: - Summary: Remove redundant `return` in Scala code Key: SPARK-14167 URL: https://issues.apache.org/jira/browse/SPARK-14167 Project: Spark Issue Type: Task Reporter: Dongjoon Hyun Priority: Trivial Spark Scala code takes advantage of `return` statement as a control flow in many cases, but it does not mean *redundant* `return` statements are needed. This issue tries to remove redundant `return` statement in Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs
[ https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212428#comment-15212428 ] Joseph K. Bradley commented on SPARK-13783: --- I'd prefer what [~GayathriMurali] mentioned; that's what is done in spark.mllib. That should be more efficient (taking more advantage of columnar storage). I do want us to save Params for each tree since that will be more robust to future code changes (rather than re-creating them based on the GBT params). However, that may require some code refactoring so that the GBT can get a set of {{jsonParams}} for each tree. Given that, the GBT could store that JSON in another DataFrame. How does that sound? It may make sense to implement export/import for one ensemble model before the other since both might require changes to the single-tree save/load. Would you mind helping to review each other's work? Who would prefer to go first? Thanks! > Model export/import for spark.ml: GBTs > -- > > Key: SPARK-13783 > URL: https://issues.apache.org/jira/browse/SPARK-13783 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for both GBTClassifier and GBTRegressor. The implementation > should reuse the one for DecisionTree*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6725) Model export/import for Pipeline API (Scala)
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Comment: was deleted (was: Ping! Is anyone interested in picking up the GBT or RandomForest issues to get them into 2.0?) > Model export/import for Pipeline API (Scala) > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212410#comment-15212410 ] holdenk commented on SPARK-14141: - I can take a crack at this, seems pretty reasonable & small. > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13783) Model export/import for spark.ml: GBTs
[ https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13783: -- Target Version/s: 2.0.0 > Model export/import for spark.ml: GBTs > -- > > Key: SPARK-13783 > URL: https://issues.apache.org/jira/browse/SPARK-13783 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for both GBTClassifier and GBTRegressor. The implementation > should reuse the one for DecisionTree*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13784) Model export/import for spark.ml: RandomForests
[ https://issues.apache.org/jira/browse/SPARK-13784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13784: -- Target Version/s: 2.0.0 > Model export/import for spark.ml: RandomForests > --- > > Key: SPARK-13784 > URL: https://issues.apache.org/jira/browse/SPARK-13784 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for both RandomForestClassifier and RandomForestRegressor. The > implementation should reuse the one for DecisionTree*. > It should augment NodeData with a tree ID so that all nodes can be stored in > a single DataFrame. It should reconstruct the trees in a distributed fashion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API (Scala)
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Target Version/s: 2.0.0 > Model export/import for Pipeline API (Scala) > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API (Scala)
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Summary: Model export/import for Pipeline API (Scala) (was: Model export/import for Pipeline API) > Model export/import for Pipeline API (Scala) > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212402#comment-15212402 ] Joseph K. Bradley commented on SPARK-6725: -- Ping! Is anyone interested in picking up the GBT or RandomForest issues to get them into 2.0? > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14081: Assignee: Apache Spark > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford >Assignee: Apache Spark > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212395#comment-15212395 ] Apache Spark commented on SPARK-14081: -- User 'traviscrawford' has created a pull request for this issue: https://github.com/apache/spark/pull/11967 > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14081: Assignee: (was: Apache Spark) > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case
[ https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14165: -- Issue Type: Bug (was: Improvement) Hm yeah this is a problem then. I'm not 100% sure which case sensitivity rules apply but seems like it's case-insensitive here. In that case whatever went to get column "abc" in the second table should succeed above. > NoSuchElementException: None.get when joining DataFrames with Seq of fields > of different case > - > > Key: SPARK-14165 > URL: https://issues.apache.org/jira/browse/SPARK-14165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > {code} > scala> val left = Seq((1,"a")).toDF("id", "abc") > left: org.apache.spark.sql.DataFrame = [id: int, abc: string] > scala> val right = Seq((1,"a")).toDF("id", "ABC") > right: org.apache.spark.sql.DataFrame = [id: int, ABC: string] > scala> left.join(right, Seq("abc")) > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299) > at org.apache.spark.sql.Dataset.join(Dataset.scala:553) > at org.apache.spark.sql.Dataset.join(Dataset.scala:526) > ... 51 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212369#comment-15212369 ] Apache Spark commented on SPARK-14159: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/11965 > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14159: Assignee: Joseph K. Bradley (was: Apache Spark) > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly
[ https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14159: Assignee: Apache Spark (was: Joseph K. Bradley) > StringIndexerModel sets output column metadata incorrectly > -- > > Key: SPARK-14159 > URL: https://issues.apache.org/jira/browse/SPARK-14159 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > StringIndexerModel.transform sets the output column metadata to use name > inputCol. It should not. Fixing this causes a problem with the metadata > produced by RFormula. > Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and > I modified VectorAttributeRewriter to find and replace all "prefixes" since > attributes collect multiple prefixes from StringIndexer + Interaction. > Note that "prefixes" is no longer accurate since internal strings may be > replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212368#comment-15212368 ] Jacek Laskowski commented on SPARK-14108: - I'd like to see the code to show case it since: {code} scala> sqlContext.emptyDataFrame.count res93: Long = 0 {code} It's simply "can't reproduce" then. > calling count() on empty dataframe throws java.util.NoSuchElementException > -- > > Key: SPARK-14108 > URL: https://issues.apache.org/jira/browse/SPARK-14108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Tested in Hadoop 2.7.2 EMR 4.x >Reporter: Krishna Shekhram >Priority: Minor > > When calling count() on empty dataframe, then spark code still tries to > iterate through the empty iterator and throws > java.util.NoSuchElementException. > Stacktrace : > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) > at scala.collection.IterableLike$class.head(IterableLike.scala:91) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514) > Code Snippet: > This code fails > if(this.df !=null){ > long countOfRows = this.df.count(); > } > If I do this then it works > if(this.df !=null && ! this.df.rdd().isEmpty()){ > long countOfRows = this.df.count(); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case
[ https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212365#comment-15212365 ] Jacek Laskowski commented on SPARK-14165: - Right, but: {code} scala> left.join(right, $"abc" === $"ABC") org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: abc#378, abc#386.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:261) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6$$anonfun$24.apply(Analyzer.scala:572) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6$$anonfun$24.apply(Analyzer.scala:572) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:572) {code} It sees the two columns as the same, doesn't it? > NoSuchElementException: None.get when joining DataFrames with Seq of fields > of different case > - > > Key: SPARK-14165 > URL: https://issues.apache.org/jira/browse/SPARK-14165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > {code} > scala> val left = Seq((1,"a")).toDF("id", "abc") > left: org.apache.spark.sql.DataFrame = [id: int, abc: string] > scala> val right = Seq((1,"a")).toDF("id", "ABC") > right: org.apache.spark.sql.DataFrame = [id: int, ABC: string] > scala> left.join(right, Seq("abc")) > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299) > a
[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212364#comment-15212364 ] Xusen Yin commented on SPARK-13786: --- I'll work on it. > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11666) Find the best `k` by cutting bisecting k-means cluster tree without recomputation
[ https://issues.apache.org/jira/browse/SPARK-11666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak KÖSE updated SPARK-11666: --- Comment: was deleted (was: Hi, can you share links for references about that?) > Find the best `k` by cutting bisecting k-means cluster tree without > recomputation > - > > Key: SPARK-11666 > URL: https://issues.apache.org/jira/browse/SPARK-11666 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > For example, scikit-learn's hierarchical clustering support a feature to > extract partial tree from the result. We should support a feature like that > in order to reduce compute cost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14166) Add deterministic sampling like in Hive
Ruslan Dautkhanov created SPARK-14166: - Summary: Add deterministic sampling like in Hive Key: SPARK-14166 URL: https://issues.apache.org/jira/browse/SPARK-14166 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Ruslan Dautkhanov Priority: Minor Would be great to have Spark support deterministic sampling too {quote} set hive.sample.seednumber=12345; SELECT * FROM table_a TABLESAMPLE(BUCKET 17 OUT OF 25 ON individual_id); {quote} Notice sampling is based on a hash(individual_id). https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling In this case sampling is deterministic. When we have new data loads, we get very stable samples and use it all the time in Hive. The only reason for "BUCKET x OUT OF y " syntax in Hive is "If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212363#comment-15212363 ] Burak KÖSE edited comment on SPARK-14108 at 3/25/16 8:38 PM: - Please give an example test case. was (Author: whisper): Please give a test case. > calling count() on empty dataframe throws java.util.NoSuchElementException > -- > > Key: SPARK-14108 > URL: https://issues.apache.org/jira/browse/SPARK-14108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Tested in Hadoop 2.7.2 EMR 4.x >Reporter: Krishna Shekhram >Priority: Minor > > When calling count() on empty dataframe, then spark code still tries to > iterate through the empty iterator and throws > java.util.NoSuchElementException. > Stacktrace : > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) > at scala.collection.IterableLike$class.head(IterableLike.scala:91) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514) > Code Snippet: > This code fails > if(this.df !=null){ > long countOfRows = this.df.count(); > } > If I do this then it works > if(this.df !=null && ! this.df.rdd().isEmpty()){ > long countOfRows = this.df.count(); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212363#comment-15212363 ] Burak KÖSE commented on SPARK-14108: Please give a test case. > calling count() on empty dataframe throws java.util.NoSuchElementException > -- > > Key: SPARK-14108 > URL: https://issues.apache.org/jira/browse/SPARK-14108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Tested in Hadoop 2.7.2 EMR 4.x >Reporter: Krishna Shekhram >Priority: Minor > > When calling count() on empty dataframe, then spark code still tries to > iterate through the empty iterator and throws > java.util.NoSuchElementException. > Stacktrace : > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) > at scala.collection.IterableLike$class.head(IterableLike.scala:91) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515) > at > org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514) > Code Snippet: > This code fails > if(this.df !=null){ > long countOfRows = this.df.count(); > } > If I do this then it works > if(this.df !=null && ! this.df.rdd().isEmpty()){ > long countOfRows = this.df.count(); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case
[ https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14165: -- Issue Type: Improvement (was: Bug) It's case sensitive right? your tables don't actually both have a column "abc" that your join condition claims. Are you just looking to improve the exception? that's fine. > NoSuchElementException: None.get when joining DataFrames with Seq of fields > of different case > - > > Key: SPARK-14165 > URL: https://issues.apache.org/jira/browse/SPARK-14165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > {code} > scala> val left = Seq((1,"a")).toDF("id", "abc") > left: org.apache.spark.sql.DataFrame = [id: int, abc: string] > scala> val right = Seq((1,"a")).toDF("id", "ABC") > right: org.apache.spark.sql.DataFrame = [id: int, ABC: string] > scala> left.join(right, Seq("abc")) > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299) > at org.apache.spark.sql.Dataset.join(Dataset.scala:553) > at org.apache.spark.sql.Dataset.join(Dataset.scala:526) > ... 51 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14131) Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite
[ https://issues.apache.org/jira/browse/SPARK-14131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14131. -- Resolution: Fixed Fix Version/s: 2.0.0 > Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite > --- > > Key: SPARK-14131 > URL: https://issues.apache.org/jira/browse/SPARK-14131 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 > ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we > interrupt some thread running Shell.runCommand, we may hit this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14123) Function related commands
[ https://issues.apache.org/jira/browse/SPARK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212356#comment-15212356 ] Bo Meng commented on SPARK-14123: - I will be working on this. Thanks. > Function related commands > - > > Key: SPARK-14123 > URL: https://issues.apache.org/jira/browse/SPARK-14123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > We should support TOK_CREATEFUNCTION/TOK_DROPFUNCTION. > For now, we can throw exceptions for TOK_CREATEMACRO/TOK_DROPMACRO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: Please go through the current example code and list possible duplicates. To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > To find out all examples of ml/mllib that don't contain "example on": > {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14164) Improve input layer validation of MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14164: -- Summary: Improve input layer validation of MultilayerPerceptronClassifier (was: Improve input layer validation to MultilayerPerceptronClassifier) > Improve input layer validation of MultilayerPerceptronClassifier > > > Key: SPARK-14164 > URL: https://issues.apache.org/jira/browse/SPARK-14164 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves an input layer validation and related testcases to > MultilayerPerceptronClassifier. > {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} > -// TODO: how to check ALSO that all elements are greater than 0? > -ParamValidators.arrayLengthGt(1) > +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14164) Improve input layer validation of MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14164: -- Description: This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier. {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} -// TODO: how to check ALSO that all elements are greater than 0? -ParamValidators.arrayLengthGt(1) +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 {code} was: This issue improves an input layer validation and related testcases to MultilayerPerceptronClassifier. {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} -// TODO: how to check ALSO that all elements are greater than 0? -ParamValidators.arrayLengthGt(1) +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 {code} > Improve input layer validation of MultilayerPerceptronClassifier > > > Key: SPARK-14164 > URL: https://issues.apache.org/jira/browse/SPARK-14164 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves an input layer validation and adds related testcases to > MultilayerPerceptronClassifier. > {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} > -// TODO: how to check ALSO that all elements are greater than 0? > -ParamValidators.arrayLengthGt(1) > +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14164) Improve input layer validation to MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14164: Assignee: (was: Apache Spark) > Improve input layer validation to MultilayerPerceptronClassifier > > > Key: SPARK-14164 > URL: https://issues.apache.org/jira/browse/SPARK-14164 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves an input layer validation and related testcases to > MultilayerPerceptronClassifier. > {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} > -// TODO: how to check ALSO that all elements are greater than 0? > -ParamValidators.arrayLengthGt(1) > +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14164) Improve input layer validation to MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14164: Assignee: Apache Spark > Improve input layer validation to MultilayerPerceptronClassifier > > > Key: SPARK-14164 > URL: https://issues.apache.org/jira/browse/SPARK-14164 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue improves an input layer validation and related testcases to > MultilayerPerceptronClassifier. > {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid} > -// TODO: how to check ALSO that all elements are greater than 0? > -ParamValidators.arrayLengthGt(1) > +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org