date:20160325

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-25 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212848#comment-15212848
 ] 

Jeff Zhang commented on SPARK-13587:


bq. Have you considered using NFS or Amazon EFS to allow users to create and 
manage their own envs and then mounting those on worker/executor nodes? 
This problem is most of time you are not administrator and don't have 
permission to do that. It's inefficient to ask your administrator to install 
the environment for you. 

bq. "one alternative to shared mounts is to store the thing in HDFS and use 
something like --files / --archives in Spark. 
Some packages are binary and need to compile. And it is not easy to do 
dependency management in this way. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14177) Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES"

2016-03-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14177:

Description: 
Based on the Hive DDL document 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The syntax of DDL command for {{ALTER DATABASE}} is
{code}
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES 
(property_name=property_value, ...)
{code}
{{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES

The syntax of DDL command for {{DESCRIBE DATABASE}} is
{code}
DESCRIBE DATABASE [EXTENDED] db_name
{code}
{{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has 
been set), and its root location on the filesystem. When extended is true, it 
also shows the database's properties



  was:
Based on the Hive DDL document 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The syntax of DDL command for ALTER DATABASE is
{code}
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES 
(property_name=property_value, ...)
{code}
{{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES

The syntax of DDL command for DESCRIBE DATABASE is
{code}
DESCRIBE DATABASE [EXTENDED] db_name
{code}
{{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has 
been set), and its root location on the filesystem. When extended is true, it 
also shows the database's properties




> Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES"
> 
>
> Key: SPARK-14177
> URL: https://issues.apache.org/jira/browse/SPARK-14177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Based on the Hive DDL document 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
> The syntax of DDL command for {{ALTER DATABASE}} is
> {code}
> ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES 
> (property_name=property_value, ...)
> {code}
> {{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES
> The syntax of DDL command for {{DESCRIBE DATABASE}} is
> {code}
> DESCRIBE DATABASE [EXTENDED] db_name
> {code}
> {{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has 
> been set), and its root location on the filesystem. When extended is true, it 
> also shows the database's properties



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14177) Parse DDL command: "DESCRIBE DATABASE" and "ALTER DATABASE SET DBPROPERTIES"

2016-03-25 Thread Xiao Li (JIRA)

Xiao Li created SPARK-14177:
---

 Summary: Parse DDL command: "DESCRIBE DATABASE" and "ALTER 
DATABASE SET DBPROPERTIES"
 Key: SPARK-14177
 URL: https://issues.apache.org/jira/browse/SPARK-14177
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Based on the Hive DDL document 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The syntax of DDL command for ALTER DATABASE is
{code}
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES 
(property_name=property_value, ...)
{code}
{{ALTER DATABASE}} is to add new (key, value) pairs into DBPROPERTIES

The syntax of DDL command for DESCRIBE DATABASE is
{code}
DESCRIBE DATABASE [EXTENDED] db_name
{code}
{{DESCRIBE DATABASE}} shows the name of the database, its comment (if one has 
been set), and its root location on the filesystem. When extended is true, it 
also shows the database's properties





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14176) Add processing time trigger

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212828#comment-15212828
 ] 

Apache Spark commented on SPARK-14176:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11976

> Add processing time trigger
> ---
>
> Key: SPARK-14176
> URL: https://issues.apache.org/jira/browse/SPARK-14176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Add a processing time trigger to control the batch processing speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14176) Add processing time trigger

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14176:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add processing time trigger
> ---
>
> Key: SPARK-14176
> URL: https://issues.apache.org/jira/browse/SPARK-14176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Add a processing time trigger to control the batch processing speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14176) Add processing time trigger

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14176:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add processing time trigger
> ---
>
> Key: SPARK-14176
> URL: https://issues.apache.org/jira/browse/SPARK-14176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Add a processing time trigger to control the batch processing speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14176) Add processing time trigger

2016-03-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-14176:


 Summary: Add processing time trigger
 Key: SPARK-14176
 URL: https://issues.apache.org/jira/browse/SPARK-14176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Add a processing time trigger to control the batch processing speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14175) Simplify whole stage codegen interface

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14175:


Assignee: Apache Spark  (was: Davies Liu)

> Simplify whole stage codegen interface
> --
>
> Key: SPARK-14175
> URL: https://issues.apache.org/jira/browse/SPARK-14175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> 1. remove consumeChild
> 2. always create code for UnsafeRow and variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14175) Simplify whole stage codegen interface

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212820#comment-15212820
 ] 

Apache Spark commented on SPARK-14175:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11975

> Simplify whole stage codegen interface
> --
>
> Key: SPARK-14175
> URL: https://issues.apache.org/jira/browse/SPARK-14175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> 1. remove consumeChild
> 2. always create code for UnsafeRow and variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14175) Simplify whole stage codegen interface

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14175:


Assignee: Davies Liu  (was: Apache Spark)

> Simplify whole stage codegen interface
> --
>
> Key: SPARK-14175
> URL: https://issues.apache.org/jira/browse/SPARK-14175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> 1. remove consumeChild
> 2. always create code for UnsafeRow and variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14175) Simplify whole stage codegen interface

2016-03-25 Thread Davies Liu (JIRA)

Davies Liu created SPARK-14175:
--

 Summary: Simplify whole stage codegen interface
 Key: SPARK-14175
 URL: https://issues.apache.org/jira/browse/SPARK-14175
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


1. remove consumeChild
2. always create code for UnsafeRow and variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212776#comment-15212776
 ] 

zhengruifeng commented on SPARK-14174:
--

There is another sklean example for MiniBatch KMeans:

http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212772#comment-15212772
 ] 

Apache Spark commented on SPARK-14174:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11974

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14174:


Assignee: (was: Apache Spark)

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14174:


Assignee: Apache Spark

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-14174:


 Summary: Accelerate KMeans via Mini-Batch EM
 Key: SPARK-14174
 URL: https://issues.apache.org/jira/browse/SPARK-14174
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.

I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
significant.
The MiniBatch KMeans is named XMeans in following lines.

val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()

val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8

val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8

val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8

The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence 
have significant difference:
KMeans2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01)   90sec

With appropriate fraction, the bigger the dataset is, the higher speedup is.

The data used above have 8,100,000 samples, 784 features. It can be downloaded 
here 
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n

2016-03-25 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-14109.
---
Resolution: Fixed

> HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like 
> s3n
> -
>
> Key: SPARK-14109
> URL: https://issues.apache.org/jira/browse/SPARK-14109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. 
> However, FileContext implementations may not exist for many scheme for which 
> there may be FileSystem implementations. In those cases, rather than failing 
> completely, we should fallback to the FileSystem based implementation, and 
> log warning that there may be file consistency issues in case the log 
> directory is concurrently modified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14139) Dataset loses nullability in operations with RowEncoder

2016-03-25 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212743#comment-15212743
 ] 

koert kuipers commented on SPARK-14139:
---

it is not clear to me if the goal should remain to derive the schema from the 
logical plan, or revert back to a schema from the encoder. i am going to assume 
a schema from logical plan, and will try to fix nullable for the logical plan. 

> Dataset loses nullability in operations with RowEncoder
> ---
>
> Key: SPARK-14139
> URL: https://issues.apache.org/jira/browse/SPARK-14139
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> When i do
> {noformat}
> val df1 = sc.makeRDD(1 to 3).toDF
> val df2 = df1.map(row => Row(row(0).asInstanceOf[Int] + 
> 1))(RowEncoder(df1.schema))
> println(s"schema before ${df1.schema} and after ${df2.schema}")
> {noformat}
> I get:
> {noformat}
> schema before StructType(StructField(value,IntegerType,false)) and after 
> StructType(StructField(value,IntegerType,true))
> {noformat}
> The change in field nullable is unexpected and i consider it a bug.
> This bug was introduced in:
>  [SPARK-13244][SQL] Migrates DataFrame to Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”

2016-03-25 Thread liyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-14173:
--
Description: 
when i submit streaming application on *yarn cluster*  mode, i can't find 
"spark.executor.extraJavaOptions" in Spark UI.

the log prints warning：
{quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= 
-Xloggc:/home/streaming/gc.log"{quote}

  was:
when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.

the log prints warning：
{quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= 
-Xloggc:/home/streaming/gc.log"{quote}


> Ignoring config property “spark.executor.extraJavaOptions”
> --
>
> Key: SPARK-14173
> URL: https://issues.apache.org/jira/browse/SPARK-14173
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.2
>Reporter: liyan
>
> when i submit streaming application on *yarn cluster*  mode, i can't find 
> "spark.executor.extraJavaOptions" in Spark UI.
> the log prints warning：
> {quote}Ignoring none-spark config property : 
> "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”

2016-03-25 Thread liyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-14173:
--
Description: 
when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.

the log prints warning：
{quote}Ignoring none-spark config property : "spark.executor.extraJavaOptions= 
-Xloggc:/home/streaming/gc.log"{quote}

  was:
when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.

the log prints warning：Ignoring none-spark config property : 
"spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"


> Ignoring config property “spark.executor.extraJavaOptions”
> --
>
> Key: SPARK-14173
> URL: https://issues.apache.org/jira/browse/SPARK-14173
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.2
>Reporter: liyan
>
> when i submit streaming application on *yarn cluster* , i can't find 
> "spark.executor.extraJavaOptions" in Spark UI.
> the log prints warning：
> {quote}Ignoring none-spark config property : 
> "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”

2016-03-25 Thread liyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-14173:
--
Description: 
when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.

the log prints warning：Ignoring none-spark config property : 
"spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"

  was:
when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.
the log prints warning：Ignoring none-spark config property : 
"spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"


> Ignoring config property “spark.executor.extraJavaOptions”
> --
>
> Key: SPARK-14173
> URL: https://issues.apache.org/jira/browse/SPARK-14173
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.2
>Reporter: liyan
>
> when i submit streaming application on *yarn cluster* , i can't find 
> "spark.executor.extraJavaOptions" in Spark UI.
> the log prints warning：Ignoring none-spark config property : 
> "spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

2016-03-25 Thread Jack Franson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212726#comment-15212726
 ] 

Jack Franson commented on SPARK-4743:
-

Hi, I'm still running into "Task not serializable" errors with aggregateByKey 
when my initial value is an Avro object (I've configured the Kryo serializer to 
handle it via a custom KryoRegistrator). I don't see the issue when I pass in 
an empty instance of the Avro object (via new MyAvroObject()), as the initial 
value, but then I get an exception from the Kryo serializer since required 
fields are null. To get around that, I tried creating an initial value with all 
the required fields set to defaults, but then I was hit with a 
java.io.NotSerializableException causing the "Task not serializable" exception 
to fail the job, which seems to indicate that Java serialization is taking over 
again.

This is on Spark 1.5.2 with Avro 1.7.7. The line throwing the fatal Exception 
is 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304).
 

This is the closest ticket I could find around this issue, so I'm wondering if 
there are further tweaks that the Spark libraries can make to use the 
SparkEnv.serializer, or if the problem is on my end (any tips in that case 
would be much appreciated!).

Thanks for your help.

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and 
> foldByKey
> 
>
> Key: SPARK-4743
> URL: https://issues.apache.org/jira/browse/SPARK-4743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>Assignee: Ivan Vergiliev
>  Labels: performance
> Fix For: 1.3.0
>
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
> serializer to serialize and deserialize the initial value. This means that 
> the Java serializer is always used, which can be very expensive if there's a 
> large number of groups. Calling combineByKey manually and using the normal 
> serializer instead of the closure one improved the performance on the dataset 
> I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the 
> serializer here is OK, but it works correctly in my tests, and it's only 
> serializing a single value of type U, which should be serializable by the 
> default one since it can be the output of a job. Let me know if I'm missing 
> anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14173) Ignoring config property “spark.executor.extraJavaOptions”

2016-03-25 Thread liyan (JIRA)

liyan created SPARK-14173:
-

 Summary: Ignoring config property “spark.executor.extraJavaOptions”
 Key: SPARK-14173
 URL: https://issues.apache.org/jira/browse/SPARK-14173
 Project: Spark
  Issue Type: Bug
  Components: Streaming, YARN
Affects Versions: 1.5.2
Reporter: liyan


when i submit streaming application on *yarn cluster* , i can't find 
"spark.executor.extraJavaOptions" in Spark UI.
the log prints warning：Ignoring none-spark config property : 
"spark.executor.extraJavaOptions= -Xloggc:/home/streaming/gc.log"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-03-25 Thread Yingji Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingji Zhang updated SPARK-14172:
-
Description: 
When the hive sql contains nondeterministic fields,  spark plan will not push 
down the partition predicate to the HiveTableScan. For example:

{code}
-- consider following query which uses a random function to sample rows
SELECT *
FROM table_a
WHERE partition_col = 'some_value'
AND rand() < 0.01;
{code}

The spark plan will not push down the partition predicate to HiveTableScan 
which ends up scanning all partitions data from the table.

  was:
When the hive sql contains nondeterministic fields,  spark plan will not push 
down the partition predicate to the HiveTableScan. For example:
-- consider following query which uses a random function to sample rows
SELECT *
FROM table_a
WHERE partition_col = 'some_value'
AND rand() < 0.01;

The spark plan will not push down the partition predicate to HiveTableScan 
which ends up scanning all partitions data from the table.


> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-03-25 Thread Jianfeng Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianfeng Hu updated SPARK-14171:

Description: 
For example, when using percentile_approx and count distinct together, it 
raises an error complaining the argument is not constant. We have a test case 
to reproduce. Could you help look into a fix of this? This was working in 
previous version (Spark 1.4 + Hive 0.13). Thanks!

{code}--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
@@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton 
with SQLTestUtils {

 checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src 
LIMIT 1"),
   sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
+
+checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
key) FROM src LIMIT 1"),
+  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
}

   test("UDFIntegerToString") {
{code}


When running the test suite, we can see this error:

{code}
- Generic UDAF aggregates *** FAILED ***
  org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: 
hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  ...
  Cause: java.lang.reflect.InvocationTargetException:
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
  ...
  Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
argument must be a constant, but double was passed instead.
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
  at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  ...
{code}


  was:
For example, when using percentile_approx and count distinct together, it 
raises an error complaining the argument is not constant. We have a test case 
to reproduce. Could you help look into a fix of this? This was working in 
previous version (Spark 1.4 + Hive 0.13). Thanks!

{{--- 
a/sql/hive/src/test/scala/org/ap

[jira] [Updated] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-03-25 Thread Jianfeng Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianfeng Hu updated SPARK-14171:

Description: 
For example, when using percentile_approx and count distinct together, it 
raises an error complaining the argument is not constant. We have a test case 
to reproduce. Could you help look into a fix of this? This was working in 
previous version (Spark 1.4 + Hive 0.13). Thanks!

{{--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
@@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton 
with SQLTestUtils {

 checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src 
LIMIT 1"),
   sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
+
+checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
key) FROM src LIMIT 1"),
+  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
}

   test("UDFIntegerToString") {
}}


When running the test suite, we can see this error:

{{
- Generic UDAF aggregates *** FAILED ***
  org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: 
hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  ...
  Cause: java.lang.reflect.InvocationTargetException:
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
  ...
  Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
argument must be a constant, but double was passed instead.
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
  at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  ...
}}


  was:
For example, when using percentile_approx and count distinct together, it 
raises an error complaining the argument is not constant. We have a test case 
to reproduce. Could you help look into a fix of this? This was working in 
previous version (Spark 1.4 + Hive 0.13). Thanks!

```--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/

[jira] [Created] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-03-25 Thread Yingji Zhang (JIRA)

Yingji Zhang created SPARK-14172:


 Summary: Hive table partition predicate not passed down correctly
 Key: SPARK-14172
 URL: https://issues.apache.org/jira/browse/SPARK-14172
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Yingji Zhang
Priority: Critical


When the hive sql contains nondeterministic fields,  spark plan will not push 
down the partition predicate to the HiveTableScan. For example:
-- consider following query which uses a random function to sample rows
SELECT *
FROM table_a
WHERE partition_col = 'some_value'
AND rand() < 0.01;

The spark plan will not push down the partition predicate to HiveTableScan 
which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-03-25 Thread Jianfeng Hu (JIRA)

Jianfeng Hu created SPARK-14171:
---

 Summary: UDAF aggregates argument object inspector not parsed 
correctly
 Key: SPARK-14171
 URL: https://issues.apache.org/jira/browse/SPARK-14171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Jianfeng Hu
Priority: Blocker


For example, when using percentile_approx and count distinct together, it 
raises an error complaining the argument is not constant. We have a test case 
to reproduce. Could you help look into a fix of this? This was working in 
previous version (Spark 1.4 + Hive 0.13). Thanks!

```--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
@@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton 
with SQLTestUtils {

 checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM src 
LIMIT 1"),
   sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
+
+checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
key) FROM src LIMIT 1"),
+  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
}

   test("UDFIntegerToString") {```


When running the test suite, we can see this error:

```
- Generic UDAF aggregates *** FAILED ***
  org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: 
hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
  at 
org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  ...
  Cause: java.lang.reflect.InvocationTargetException:
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
  ...
  Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
argument must be a constant, but double was passed instead.
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
  at 
org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
  at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  ...
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2016-03-25 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212671#comment-15212671
 ] 

Reynold Xin commented on SPARK-12436:
-

I don't have everything page in, but why isn't an empty string just a string 
type?


> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2016-03-25 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212670#comment-15212670
 ] 

Reynold Xin commented on SPARK-1153:


[~ntietz] changing this will very likely make performance regress for long ids, 
due to the lack of specialization. 

You might want to look into graphframes for more general graph functionalities 
too: https://github.com/graphframes/graphframes 

> Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
> --
>
> Key: SPARK-1153
> URL: https://issues.apache.org/jira/browse/SPARK-1153
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 0.9.0
>Reporter: Deepak Nulu
>
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys. Others might have a different 
> type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
> will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14073) Move streaming-flume back to Spark

2016-03-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14073.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move streaming-flume back to Spark
> --
>
> Key: SPARK-14073
> URL: https://issues.apache.org/jira/browse/SPARK-14073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14170) Remove the PR template before pushing changes

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14170:


Assignee: Apache Spark

> Remove the PR template before pushing changes
> -
>
> Key: SPARK-14170
> URL: https://issues.apache.org/jira/browse/SPARK-14170
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> As (briefly) discussed in the mailing list, it would be nice to not include 
> the PR template in every commit message (when people forget to delete it). 
> This can be done by making some small changes to the template, and update the 
> merge script used by committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14170) Remove the PR template before pushing changes

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212653#comment-15212653
 ] 

Apache Spark commented on SPARK-14170:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11973

> Remove the PR template before pushing changes
> -
>
> Key: SPARK-14170
> URL: https://issues.apache.org/jira/browse/SPARK-14170
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> As (briefly) discussed in the mailing list, it would be nice to not include 
> the PR template in every commit message (when people forget to delete it). 
> This can be done by making some small changes to the template, and update the 
> merge script used by committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14170) Remove the PR template before pushing changes

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14170:


Assignee: (was: Apache Spark)

> Remove the PR template before pushing changes
> -
>
> Key: SPARK-14170
> URL: https://issues.apache.org/jira/browse/SPARK-14170
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> As (briefly) discussed in the mailing list, it would be nice to not include 
> the PR template in every commit message (when people forget to delete it). 
> This can be done by making some small changes to the template, and update the 
> merge script used by committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14170) Remove the PR template before pushing changes

2016-03-25 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-14170:
--

 Summary: Remove the PR template before pushing changes
 Key: SPARK-14170
 URL: https://issues.apache.org/jira/browse/SPARK-14170
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


As (briefly) discussed in the mailing list, it would be nice to not include the 
PR template in every commit message (when people forget to delete it). This can 
be done by making some small changes to the template, and update the merge 
script used by committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14013) Properly implement temporary functions in SessionCatalog

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14013:


Assignee: (was: Apache Spark)

> Properly implement temporary functions in SessionCatalog
> 
>
> Key: SPARK-14013
> URL: https://issues.apache.org/jira/browse/SPARK-14013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>
> Right now `SessionCatalog` just contains `CatalogFunction`, which is 
> metadata. In the future the catalog should probably take in a function 
> registry or something.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14013) Properly implement temporary functions in SessionCatalog

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212634#comment-15212634
 ] 

Apache Spark commented on SPARK-14013:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11972

> Properly implement temporary functions in SessionCatalog
> 
>
> Key: SPARK-14013
> URL: https://issues.apache.org/jira/browse/SPARK-14013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>
> Right now `SessionCatalog` just contains `CatalogFunction`, which is 
> metadata. In the future the catalog should probably take in a function 
> registry or something.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14013) Properly implement temporary functions in SessionCatalog

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14013:


Assignee: Apache Spark

> Properly implement temporary functions in SessionCatalog
> 
>
> Key: SPARK-14013
> URL: https://issues.apache.org/jira/browse/SPARK-14013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Right now `SessionCatalog` just contains `CatalogFunction`, which is 
> metadata. In the future the catalog should probably take in a function 
> registry or something.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13955) Spark in yarn mode fails

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13955:


Assignee: Apache Spark

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212609#comment-15212609
 ] 

Apache Spark commented on SPARK-13955:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11970

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13955) Spark in yarn mode fails

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13955:


Assignee: (was: Apache Spark)

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14169) Add UninterruptibleThread

2016-03-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-14169:


 Summary: Add UninterruptibleThread
 Key: SPARK-14169
 URL: https://issues.apache.org/jira/browse/SPARK-14169
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Extract the workaround for HADOOP-10622 introduced by #11940 into 
UninterruptibleThread so that we can test and reuse it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14169) Add UninterruptibleThread

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212608#comment-15212608
 ] 

Apache Spark commented on SPARK-14169:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11971

> Add UninterruptibleThread
> -
>
> Key: SPARK-14169
> URL: https://issues.apache.org/jira/browse/SPARK-14169
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Extract the workaround for HADOOP-10622 introduced by #11940 into 
> UninterruptibleThread so that we can test and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14169) Add UninterruptibleThread

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14169:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add UninterruptibleThread
> -
>
> Key: SPARK-14169
> URL: https://issues.apache.org/jira/browse/SPARK-14169
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Extract the workaround for HADOOP-10622 introduced by #11940 into 
> UninterruptibleThread so that we can test and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14169) Add UninterruptibleThread

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14169:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add UninterruptibleThread
> -
>
> Key: SPARK-14169
> URL: https://issues.apache.org/jira/browse/SPARK-14169
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Extract the workaround for HADOOP-10622 introduced by #11940 into 
> UninterruptibleThread so that we can test and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-25 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212590#comment-15212590
 ] 

holdenk commented on SPARK-14141:
-

So with RDDs there is `toLocalIterator` which you could use to do this 
(although you should make sure your input is cached first).

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-25 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212587#comment-15212587
 ] 

holdenk commented on SPARK-14141:
-

The more I look at this the more I think its not a good fit for Spark.

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-25 Thread Luke Miner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212563#comment-15212563
 ] 

Luke Miner commented on SPARK-14141:


Is there any way to do this process in chunks: read a chunk of data into a dict 
and then append to a pandas dataframe with the pre-specified datatypes?

The big advantage of a pandas dataframe with categorical datatypes is that it 
can potentially have a much much smaller memory footprint. However, if 
everything is loaded into a huge dict beforehand, there's much less of an 
upside.

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14159:
--
Target Version/s: 1.6.2, 2.0.0  (was: 2.0.0)

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-14159:
---

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14159:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14159:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14159.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11965
[https://github.com/apache/spark/pull/11965]

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-25 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212548#comment-15212548
 ] 

holdenk commented on SPARK-14141:
-

So following up, `from_records` doesn't take dtypes although we could convert 
to a dict "like" as an intermediate step and use `from_dicts` if the user 
specified specific types. Its less clear to me that this is what we want to do, 
but I'll make a WIP PR so we can take a look and see if it looks like a 
reasonable change or something we'd rather not expose.

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14168:


Assignee: Imran Rashid  (was: Apache Spark)

> Managed Memory Leak Msg Should Only Be a Warning
> 
>
> Key: SPARK-14168
> URL: https://issues.apache.org/jira/browse/SPARK-14168
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> When a task is completed, executors check to see if all managed memory for 
> the task was correctly released, and logs an error when it wasn't.  However, 
> it turns out its OK for there to be memory that wasn't released when an 
> Iterator isn't read to completion, eg., with {{rdd.take()}}.  This results in 
> a scary error msg in the executor logs:
> {noformat}
> 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 
> 16259594 bytes, TID = 24
> {noformat}
> Furthermore, if tasks fails for any reason, this msg is also triggered.  This 
> can lead users to believe that the failure was from the memory leak, when the 
> root cause could be entirely different.  Eg., the same error msg appears in 
> executor logs with this clearly broken user code run with {{spark-shell 
> --master 'local-cluster[2,2,1024]'}}
> {code}
> sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
> x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") 
> }.collect
> {code}
> We should downgrade the msg to a warning and link to a more detailed 
> explanation.
> See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from 
> users (and perhaps a true fix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212501#comment-15212501
 ] 

Apache Spark commented on SPARK-14168:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/11969

> Managed Memory Leak Msg Should Only Be a Warning
> 
>
> Key: SPARK-14168
> URL: https://issues.apache.org/jira/browse/SPARK-14168
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> When a task is completed, executors check to see if all managed memory for 
> the task was correctly released, and logs an error when it wasn't.  However, 
> it turns out its OK for there to be memory that wasn't released when an 
> Iterator isn't read to completion, eg., with {{rdd.take()}}.  This results in 
> a scary error msg in the executor logs:
> {noformat}
> 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 
> 16259594 bytes, TID = 24
> {noformat}
> Furthermore, if tasks fails for any reason, this msg is also triggered.  This 
> can lead users to believe that the failure was from the memory leak, when the 
> root cause could be entirely different.  Eg., the same error msg appears in 
> executor logs with this clearly broken user code run with {{spark-shell 
> --master 'local-cluster[2,2,1024]'}}
> {code}
> sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
> x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") 
> }.collect
> {code}
> We should downgrade the msg to a warning and link to a more detailed 
> explanation.
> See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from 
> users (and perhaps a true fix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-25 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212497#comment-15212497
 ] 

Imran Rashid edited comment on SPARK-11293 at 3/25/16 10:27 PM:


I've seen a few people misled by the error msg, so I'd like to try to downgrade 
that.  I've created a separate ticket [SPARK-141868 | 
https://issues.apache.org/jira/browse/SPARK-14168] just for changing the msg, 
in case there is some fix in store here.


was (Author: irashid):
I've seen a few people misled by the error msg, so I'd like to try to downgrade 
that.  I've created a separate ticket just for changing the msg, in case there 
is some fix in store here.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14168:


Assignee: Apache Spark  (was: Imran Rashid)

> Managed Memory Leak Msg Should Only Be a Warning
> 
>
> Key: SPARK-14168
> URL: https://issues.apache.org/jira/browse/SPARK-14168
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Minor
>
> When a task is completed, executors check to see if all managed memory for 
> the task was correctly released, and logs an error when it wasn't.  However, 
> it turns out its OK for there to be memory that wasn't released when an 
> Iterator isn't read to completion, eg., with {{rdd.take()}}.  This results in 
> a scary error msg in the executor logs:
> {noformat}
> 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 
> 16259594 bytes, TID = 24
> {noformat}
> Furthermore, if tasks fails for any reason, this msg is also triggered.  This 
> can lead users to believe that the failure was from the memory leak, when the 
> root cause could be entirely different.  Eg., the same error msg appears in 
> executor logs with this clearly broken user code run with {{spark-shell 
> --master 'local-cluster[2,2,1024]'}}
> {code}
> sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
> x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") 
> }.collect
> {code}
> We should downgrade the msg to a warning and link to a more detailed 
> explanation.
> See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from 
> users (and perhaps a true fix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-25 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212497#comment-15212497
 ] 

Imran Rashid commented on SPARK-11293:
--

I've seen a few people misled by the error msg, so I'd like to try to downgrade 
that.  I've created a separate ticket just for changing the msg, in case there 
is some fix in store here.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning

2016-03-25 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-14168:


 Summary: Managed Memory Leak Msg Should Only Be a Warning
 Key: SPARK-14168
 URL: https://issues.apache.org/jira/browse/SPARK-14168
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor


When a task is completed, executors check to see if all managed memory for the 
task was correctly released, and logs an error when it wasn't.  However, it 
turns out its OK for there to be memory that wasn't released when an Iterator 
isn't read to completion, eg., with {{rdd.take()}}.  This results in a scary 
error msg in the executor logs:

{noformat}
16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 16259594 
bytes, TID = 24
{noformat}

Furthermore, if tasks fails for any reason, this msg is also triggered.  This 
can lead users to believe that the failure was from the memory leak, when the 
root cause could be entirely different.  Eg., the same error msg appears in 
executor logs with this clearly broken user code run with {{spark-shell 
--master 'local-cluster[2,2,1024]'}}

{code}
sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") 
}.collect
{code}

We should downgrade the msg to a warning and link to a more detailed 
explanation.

See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from 
users (and perhaps a true fix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14091) Improve performance of SparkContext.getCallSite()

2016-03-25 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14091.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11911
[https://github.com/apache/spark/pull/11911]

> Improve performance of SparkContext.getCallSite()
> -
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.0.0
>
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14091) Improve performance of SparkContext.getCallSite()

2016-03-25 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14091:
---
Summary: Improve performance of SparkContext.getCallSite()  (was: Consider 
improving performance of SparkContext.getCallSite())

> Improve performance of SparkContext.getCallSite()
> -
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-25 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14091:
---
Assignee: Rajesh Balamohan

> Consider improving performance of SparkContext.getCallSite()
> 
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14167) Remove redundant `return` in Scala code

2016-03-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-14167.
---
Resolution: Won't Fix

I close this issue based on the comment of [~joshrosen].

"I'm actually wary of this change. There's a number of times where I've found 
redundant returns but have chosen not to remove them because I was afraid of 
future code refactorings or changes accidentally making the implicit return no 
longer take effect. If this isn't strictly necessary, I'd prefer to not do this 
cleanup."

> Remove redundant `return` in Scala code
> ---
>
> Key: SPARK-14167
> URL: https://issues.apache.org/jira/browse/SPARK-14167
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Spark Scala code takes advantage of `return` statement as a control flow in 
> many cases, but it does not mean *redundant* `return` statements are needed. 
> This issue tries to remove redundant `return` statement in Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-03-25 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212442#comment-15212442
 ] 

Gayathri Murali commented on SPARK-13783:
-

Thanks [~josephkb]. I can go first, as I am almost done making changes. I could 
definitely review [~yanboliang]'s code and would really appreciate the same 
help. 

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14167) Remove redundant `return` in Scala code

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212441#comment-15212441
 ] 

Apache Spark commented on SPARK-14167:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11968

> Remove redundant `return` in Scala code
> ---
>
> Key: SPARK-14167
> URL: https://issues.apache.org/jira/browse/SPARK-14167
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Spark Scala code takes advantage of `return` statement as a control flow in 
> many cases, but it does not mean *redundant* `return` statements are needed. 
> This issue tries to remove redundant `return` statement in Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14167) Remove redundant `return` in Scala code

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14167:


Assignee: Apache Spark

> Remove redundant `return` in Scala code
> ---
>
> Key: SPARK-14167
> URL: https://issues.apache.org/jira/browse/SPARK-14167
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Spark Scala code takes advantage of `return` statement as a control flow in 
> many cases, but it does not mean *redundant* `return` statements are needed. 
> This issue tries to remove redundant `return` statement in Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14167) Remove redundant `return` in Scala code

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14167:


Assignee: (was: Apache Spark)

> Remove redundant `return` in Scala code
> ---
>
> Key: SPARK-14167
> URL: https://issues.apache.org/jira/browse/SPARK-14167
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Spark Scala code takes advantage of `return` statement as a control flow in 
> many cases, but it does not mean *redundant* `return` statements are needed. 
> This issue tries to remove redundant `return` statement in Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13842) Consider iter and getitem methods for pyspark.sql.types.StructType

2016-03-25 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212438#comment-15212438
 ] 

holdenk commented on SPARK-13842:
-

This makes some additional sense when we consider that `StructType` in Scala 
has an `apply` function which (when given a single column name) returns the 
corresponding `StructField` - so part of this could be viewed as API parity.

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14167) Remove redundant `return` in Scala code

2016-03-25 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-14167:
-

 Summary: Remove redundant `return` in Scala code
 Key: SPARK-14167
 URL: https://issues.apache.org/jira/browse/SPARK-14167
 Project: Spark
  Issue Type: Task
Reporter: Dongjoon Hyun
Priority: Trivial


Spark Scala code takes advantage of `return` statement as a control flow in 
many cases, but it does not mean *redundant* `return` statements are needed. 
This issue tries to remove redundant `return` statement in Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-03-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212428#comment-15212428
 ] 

Joseph K. Bradley commented on SPARK-13783:
---

I'd prefer what [~GayathriMurali] mentioned; that's what is done in 
spark.mllib.  That should be more efficient (taking more advantage of columnar 
storage).

I do want us to save Params for each tree since that will be more robust to 
future code changes (rather than re-creating them based on the GBT params).  
However, that may require some code refactoring so that the GBT can get a set 
of {{jsonParams}} for each tree.  Given that, the GBT could store that JSON in 
another DataFrame.

How does that sound?

It may make sense to implement export/import for one ensemble model before the 
other since both might require changes to the single-tree save/load.  Would you 
mind helping to review each other's work?  Who would prefer to go first?  
Thanks!

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-6725) Model export/import for Pipeline API (Scala)

2016-03-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Comment: was deleted

(was: Ping!  Is anyone interested in picking up the GBT or RandomForest issues 
to get them into 2.0?)

> Model export/import for Pipeline API (Scala)
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-25 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212410#comment-15212410
 ] 

holdenk commented on SPARK-14141:
-

I can take a crack at this, seems pretty reasonable & small.

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-03-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13783:
--
Target Version/s: 2.0.0

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13784) Model export/import for spark.ml: RandomForests

2016-03-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13784:
--
Target Version/s: 2.0.0

> Model export/import for spark.ml: RandomForests
> ---
>
> Key: SPARK-13784
> URL: https://issues.apache.org/jira/browse/SPARK-13784
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both RandomForestClassifier and RandomForestRegressor.  The 
> implementation should reuse the one for DecisionTree*.
> It should augment NodeData with a tree ID so that all nodes can be stored in 
> a single DataFrame.  It should reconstruct the trees in a distributed fashion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API (Scala)

2016-03-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Target Version/s: 2.0.0

> Model export/import for Pipeline API (Scala)
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API (Scala)

2016-03-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Summary: Model export/import for Pipeline API (Scala)  (was: Model 
export/import for Pipeline API)

> Model export/import for Pipeline API (Scala)
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API

2016-03-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212402#comment-15212402
 ] 

Joseph K. Bradley commented on SPARK-6725:
--

Ping!  Is anyone interested in picking up the GBT or RandomForest issues to get 
them into 2.0?

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14081:


Assignee: Apache Spark

> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>Assignee: Apache Spark
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212395#comment-15212395
 ] 

Apache Spark commented on SPARK-14081:
--

User 'traviscrawford' has created a pull request for this issue:
https://github.com/apache/spark/pull/11967

> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14081:


Assignee: (was: Apache Spark)

> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case

2016-03-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14165:
--
Issue Type: Bug  (was: Improvement)

Hm yeah this is a problem then. I'm not 100% sure which case sensitivity rules 
apply but seems like it's case-insensitive here. In that case whatever went to 
get column "abc" in the second table should succeed above.

> NoSuchElementException: None.get when joining DataFrames with Seq of fields 
> of different case
> -
>
> Key: SPARK-14165
> URL: https://issues.apache.org/jira/browse/SPARK-14165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> val left = Seq((1,"a")).toDF("id", "abc")
> left: org.apache.spark.sql.DataFrame = [id: int, abc: string]
> scala> val right = Seq((1,"a")).toDF("id", "ABC")
> right: org.apache.spark.sql.DataFrame = [id: int, ABC: string]
> scala> left.join(right, Seq("abc"))
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:553)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:526)
>   ... 51 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212369#comment-15212369
 ] 

Apache Spark commented on SPARK-14159:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/11965

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14159:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14159:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException

2016-03-25 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212368#comment-15212368
 ] 

Jacek Laskowski commented on SPARK-14108:
-

I'd like to see the code to show case it since:

{code}
scala> sqlContext.emptyDataFrame.count
res93: Long = 0
{code}

It's simply "can't reproduce" then.

> calling count() on empty dataframe throws java.util.NoSuchElementException
> --
>
> Key: SPARK-14108
> URL: https://issues.apache.org/jira/browse/SPARK-14108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Tested in Hadoop 2.7.2 EMR 4.x
>Reporter: Krishna Shekhram
>Priority: Minor
>
> When calling count() on empty dataframe, then spark code still tries to 
> iterate through the empty iterator and throws 
> java.util.NoSuchElementException.
> Stacktrace :
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514)
> Code Snippet:
> This code fails
> if(this.df !=null){
>   long countOfRows = this.df.count();
> }
> If I do this then it works
> if(this.df !=null && ! this.df.rdd().isEmpty()){
>   long countOfRows = this.df.count();
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case

2016-03-25 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212365#comment-15212365
 ] 

Jacek Laskowski commented on SPARK-14165:
-

Right, but:

{code}
scala> left.join(right, $"abc" === $"ABC")
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc#378, abc#386.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:261)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:145)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6$$anonfun$24.apply(Analyzer.scala:572)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6$$anonfun$24.apply(Analyzer.scala:572)
  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$12$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:572)
{code}

It sees the two columns as the same, doesn't it?

> NoSuchElementException: None.get when joining DataFrames with Seq of fields 
> of different case
> -
>
> Key: SPARK-14165
> URL: https://issues.apache.org/jira/browse/SPARK-14165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> val left = Seq((1,"a")).toDF("id", "abc")
> left: org.apache.spark.sql.DataFrame = [id: int, abc: string]
> scala> val right = Seq((1,"a")).toDF("id", "ABC")
> right: org.apache.spark.sql.DataFrame = [id: int, ABC: string]
> scala> left.join(right, Seq("abc"))
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299)
>   a

[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import

2016-03-25 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212364#comment-15212364
 ] 

Xusen Yin commented on SPARK-13786:
---

I'll work on it.

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11666) Find the best `k` by cutting bisecting k-means cluster tree without recomputation

2016-03-25 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-11666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak KÖSE updated SPARK-11666:
---
Comment: was deleted

(was: Hi, can you share links for references about that?)

> Find the best `k` by cutting bisecting k-means cluster tree without 
> recomputation
> -
>
> Key: SPARK-11666
> URL: https://issues.apache.org/jira/browse/SPARK-11666
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> For example, scikit-learn's hierarchical clustering support a feature to 
> extract partial tree from the result. We should support a feature like that 
> in order to reduce compute cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14166) Add deterministic sampling like in Hive

2016-03-25 Thread Ruslan Dautkhanov (JIRA)

Ruslan Dautkhanov created SPARK-14166:
-

 Summary: Add deterministic sampling like in Hive
 Key: SPARK-14166
 URL: https://issues.apache.org/jira/browse/SPARK-14166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Ruslan Dautkhanov
Priority: Minor


Would be great to have Spark support deterministic sampling too
{quote}
set hive.sample.seednumber=12345;
SELECT *
FROM table_a TABLESAMPLE(BUCKET 17 OUT OF 25 ON individual_id);
{quote}
Notice sampling is based on a hash(individual_id).

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

In this case sampling is deterministic. When we have new data loads, we get 
very stable samples and use it all the time in Hive.

The only reason for "BUCKET x OUT OF y " syntax in Hive is "If the columns 
specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY 
clause, TABLESAMPLE scans only the required hash-partitions of the table."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException

2016-03-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212363#comment-15212363
 ] 

Burak KÖSE edited comment on SPARK-14108 at 3/25/16 8:38 PM:
-

Please give an example test case.


was (Author: whisper):
Please give a test case.

> calling count() on empty dataframe throws java.util.NoSuchElementException
> --
>
> Key: SPARK-14108
> URL: https://issues.apache.org/jira/browse/SPARK-14108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Tested in Hadoop 2.7.2 EMR 4.x
>Reporter: Krishna Shekhram
>Priority: Minor
>
> When calling count() on empty dataframe, then spark code still tries to 
> iterate through the empty iterator and throws 
> java.util.NoSuchElementException.
> Stacktrace :
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514)
> Code Snippet:
> This code fails
> if(this.df !=null){
>   long countOfRows = this.df.count();
> }
> If I do this then it works
> if(this.df !=null && ! this.df.rdd().isEmpty()){
>   long countOfRows = this.df.count();
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException

2016-03-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212363#comment-15212363
 ] 

Burak KÖSE commented on SPARK-14108:


Please give a test case.

> calling count() on empty dataframe throws java.util.NoSuchElementException
> --
>
> Key: SPARK-14108
> URL: https://issues.apache.org/jira/browse/SPARK-14108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Tested in Hadoop 2.7.2 EMR 4.x
>Reporter: Krishna Shekhram
>Priority: Minor
>
> When calling count() on empty dataframe, then spark code still tries to 
> iterate through the empty iterator and throws 
> java.util.NoSuchElementException.
> Stacktrace :
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514)
> Code Snippet:
> This code fails
> if(this.df !=null){
>   long countOfRows = this.df.count();
> }
> If I do this then it works
> if(this.df !=null && ! this.df.rdd().isEmpty()){
>   long countOfRows = this.df.count();
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14165) NoSuchElementException: None.get when joining DataFrames with Seq of fields of different case

2016-03-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14165:
--
Issue Type: Improvement  (was: Bug)

It's case sensitive right? your tables don't actually both have a column "abc" 
that your join condition claims. Are you just looking to improve the exception? 
that's fine.

> NoSuchElementException: None.get when joining DataFrames with Seq of fields 
> of different case
> -
>
> Key: SPARK-14165
> URL: https://issues.apache.org/jira/browse/SPARK-14165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> val left = Seq((1,"a")).toDF("id", "abc")
> left: org.apache.spark.sql.DataFrame = [id: int, abc: string]
> scala> val right = Seq((1,"a")).toDF("id", "ABC")
> right: org.apache.spark.sql.DataFrame = [id: int, ABC: string]
> scala> left.join(right, Seq("abc"))
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$62.apply(Analyzer.scala:1444)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1426)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$29.applyOrElse(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1418)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1417)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:41)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:58)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2299)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:553)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:526)
>   ... 51 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14131) Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite

2016-03-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14131.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite
> ---
>
> Key: SPARK-14131
> URL: https://issues.apache.org/jira/browse/SPARK-14131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 
> ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we 
> interrupt some thread running Shell.runCommand, we may hit this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14123) Function related commands

2016-03-25 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212356#comment-15212356
 ] 

Bo Meng commented on SPARK-14123:
-

I will be working on this. Thanks.

> Function related commands
> -
>
> Key: SPARK-14123
> URL: https://issues.apache.org/jira/browse/SPARK-14123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> We should support TOK_CREATEFUNCTION/TOK_DROPFUNCTION.
> For now, we can throw exceptions for TOK_CREATEMACRO/TOK_DROPMACRO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-25 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14164) Improve input layer validation of MultilayerPerceptronClassifier

2016-03-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14164:
--
Summary: Improve input layer validation of MultilayerPerceptronClassifier  
(was: Improve input layer validation to MultilayerPerceptronClassifier)

> Improve input layer validation of MultilayerPerceptronClassifier
> 
>
> Key: SPARK-14164
> URL: https://issues.apache.org/jira/browse/SPARK-14164
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves an input layer validation and related testcases to 
> MultilayerPerceptronClassifier.
> {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
> -// TODO: how to check ALSO that all elements are greater than 0?
> -ParamValidators.arrayLengthGt(1)
> +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14164) Improve input layer validation of MultilayerPerceptronClassifier

2016-03-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14164:
--
Description: 
This issue improves an input layer validation and adds related testcases to 
MultilayerPerceptronClassifier.

{code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
-// TODO: how to check ALSO that all elements are greater than 0?
-ParamValidators.arrayLengthGt(1)
+(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
{code}

  was:
This issue improves an input layer validation and related testcases to 
MultilayerPerceptronClassifier.

{code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
-// TODO: how to check ALSO that all elements are greater than 0?
-ParamValidators.arrayLengthGt(1)
+(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
{code}


> Improve input layer validation of MultilayerPerceptronClassifier
> 
>
> Key: SPARK-14164
> URL: https://issues.apache.org/jira/browse/SPARK-14164
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves an input layer validation and adds related testcases to 
> MultilayerPerceptronClassifier.
> {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
> -// TODO: how to check ALSO that all elements are greater than 0?
> -ParamValidators.arrayLengthGt(1)
> +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14164) Improve input layer validation to MultilayerPerceptronClassifier

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14164:


Assignee: (was: Apache Spark)

> Improve input layer validation to MultilayerPerceptronClassifier
> 
>
> Key: SPARK-14164
> URL: https://issues.apache.org/jira/browse/SPARK-14164
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves an input layer validation and related testcases to 
> MultilayerPerceptronClassifier.
> {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
> -// TODO: how to check ALSO that all elements are greater than 0?
> -ParamValidators.arrayLengthGt(1)
> +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14164) Improve input layer validation to MultilayerPerceptronClassifier

2016-03-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14164:


Assignee: Apache Spark

> Improve input layer validation to MultilayerPerceptronClassifier
> 
>
> Key: SPARK-14164
> URL: https://issues.apache.org/jira/browse/SPARK-14164
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue improves an input layer validation and related testcases to 
> MultilayerPerceptronClassifier.
> {code:title=MultilayerPerceptronClassifier.scala|borderStyle=solid}
> -// TODO: how to check ALSO that all elements are greater than 0?
> -ParamValidators.arrayLengthGt(1)
> +(t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 199 matches

Mail list logo