[jira] [Assigned] (SPARK-10348) Improve Spark ML user guide

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10348:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Improve Spark ML user guide
> ---
>
> Key: SPARK-10348
> URL: https://issues.apache.org/jira/browse/SPARK-10348
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> improve ml-guide:
> * replace `ML Dataset` by `DataFrame` to simplify the abstraction
> * remove links to Scala API doc in the main guide
> * change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10348) Improve Spark ML user guide

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720995#comment-14720995
 ] 

Apache Spark commented on SPARK-10348:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8517

> Improve Spark ML user guide
> ---
>
> Key: SPARK-10348
> URL: https://issues.apache.org/jira/browse/SPARK-10348
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> improve ml-guide:
> * replace `ML Dataset` by `DataFrame` to simplify the abstraction
> * remove links to Scala API doc in the main guide
> * change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10348) Improve Spark ML user guide

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10348:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Improve Spark ML user guide
> ---
>
> Key: SPARK-10348
> URL: https://issues.apache.org/jira/browse/SPARK-10348
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> improve ml-guide:
> * replace `ML Dataset` by `DataFrame` to simplify the abstraction
> * remove links to Scala API doc in the main guide
> * change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10348) Improve Spark ML user guide

2015-08-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10348:
-

 Summary: Improve Spark ML user guide
 Key: SPARK-10348
 URL: https://issues.apache.org/jira/browse/SPARK-10348
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


improve ml-guide:

* replace `ML Dataset` by `DataFrame` to simplify the abstraction
* remove links to Scala API doc in the main guide
* change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10331) Update user guide to address minor comments during code review

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10331:
--
Priority: Major  (was: Minor)

> Update user guide to address minor comments during code review
> --
>
> Key: SPARK-10331
> URL: https://issues.apache.org/jira/browse/SPARK-10331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Clean-up user guides to address some minor comments in:
> https://github.com/apache/spark/pull/8304
> https://github.com/apache/spark/pull/8487



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9042) Spark SQL incompatibility with Apache Sentry

2015-08-28 Thread Nitin Kak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720977#comment-14720977
 ] 

Nitin Kak commented on SPARK-9042:
--

I agree. This is a broader issue, it only does not break unless Sentry is 
intalled. Not sure how much efforts are going on HiveContext lately. 

Is there a way to highlight this issue to the people who have worked on 
HiveContext?

> Spark SQL incompatibility with Apache Sentry
> 
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-08-28 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966
 ] 

Luvsandondov Lkhamsuren edited comment on SPARK-9963 at 8/29/15 4:31 AM:
-

Thanks @Trevor Cai! I was kind of hoping someone would approve my direction 
(this is my first time) so that I can start implementing. Once we agree on the 
direction, I can definitely implement the part.


was (Author: lkhamsurenl):
Thanks Trevor Cai! I was kind of hoping someone would approve my direction 
(this is my first time) so that I can start implementing. Once we agree on the 
direction, I can definitely implement the part.

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-08-28 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966
 ] 

Luvsandondov Lkhamsuren edited comment on SPARK-9963 at 8/29/15 4:31 AM:
-

Thanks [~trevorcai]! I was kind of hoping someone would approve my direction 
(this is my first time) so that I can start implementing. Once we agree on the 
direction, I can definitely implement the part.


was (Author: lkhamsurenl):
Thanks @Trevor Cai! I was kind of hoping someone would approve my direction 
(this is my first time) so that I can start implementing. Once we agree on the 
direction, I can definitely implement the part.

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-08-28 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966
 ] 

Luvsandondov Lkhamsuren commented on SPARK-9963:


Thanks Trevor Cai! I was kind of hoping someone would approve my direction 
(this is my first time) so that I can start implementing. Once we agree on the 
direction, I can definitely implement the part.

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8517:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7457) Perf test for ALS.recommendAll

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7457:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Perf test for ALS.recommendAll
> --
>
> Key: SPARK-7457
> URL: https://issues.apache.org/jira/browse/SPARK-7457
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9910) User guide for train validation split

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9910.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8377
[https://github.com/apache/spark/pull/8377]

> User guide for train validation split
> -
>
> Key: SPARK-9910
> URL: https://issues.apache.org/jira/browse/SPARK-9910
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Martin Zapletal
> Fix For: 1.5.0
>
>
> SPARK-8484 adds a TrainValidationSplit transformer which needs user guide 
> docs and example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-28 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-10340:

Assignee: Cheolsoo Park

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10337) Views are broken

2015-08-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720922#comment-14720922
 ] 

Yin Huai commented on SPARK-10337:
--

I think this will break if the view is created from a data source table that 
does not have a corresponding serde. I guess 100milints was created before we 
added the support of saving hive column and serde to metastore. I tried master 
with the following code.

{code}
sqlContext.range(1, 10).write.format("json").saveAsTable("yinJsonTable”)

describe formatted yinJsonTable

# col_name data_type comment
col array from deserializer
...


sqlContext.range(1, 10).write.format("parquet").saveAsTable("yinParquetTable”)

describe formatted yinParquetTable

# col_name data_type comment
id bigint
...
{code}

For json, we do not store column info to hive, you can see that metastore 
actually stores a column col with type array in it.

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPla

[jira] [Resolved] (SPARK-9803) Add transform and subset to DataFrame

2015-08-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9803.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 1.5.1,1.6.0

Resolved by https://github.com/apache/spark/pull/8503

> Add transform and subset  to DataFrame
> --
>
> Key: SPARK-9803
> URL: https://issues.apache.org/jira/browse/SPARK-9803
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Hossein Falaki
>Assignee: Felix Cheung
> Fix For: 1.5.1,1.6.0
>
>
> These three base functions are heavily used with R dataframes. It would be 
> great to have them work with Spark DataFrames:
> * transform
> * subset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior

2015-08-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-10346:
-
Description: 
Spark doesn't seem to replace existing column with the name in mutate (ie. 
mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
name 'age'), so therefore not doing that for now in transform.

Though it is clearly stated it should replace column with matching name:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html

"The tags are matched against names(_data), and for those that match, the value 
replace the corresponding variable in _data, and the others are appended to 
_data."

Also the resulting DataFrame might be hard to work with if one is to use select 
with column names, or to register the table to SQL, and so on, since then 2 
columns have the same name.


  was:
Spark doesn't seem to replace existing column with the name in mutate (ie. 
mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
name 'age'), so therefore not doing that for now in transform.

Though it is clearly stated it should replace column with matching name:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html

"The tags are matched against names(_data), and for those that match, the value 
replace the corresponding variable in _data, and the others are appended to 
_data."

Also the resulting DataFrame might be hard to work with if one is to use select 
with column names and so on.



> SparkR mutate and transform should replace column with same name to match R 
> data.frame behavior
> ---
>
> Key: SPARK-10346
> URL: https://issues.apache.org/jira/browse/SPARK-10346
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
>Reporter: Felix Cheung
>
> Spark doesn't seem to replace existing column with the name in mutate (ie. 
> mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
> name 'age'), so therefore not doing that for now in transform.
> Though it is clearly stated it should replace column with matching name:
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html
> "The tags are matched against names(_data), and for those that match, the 
> value replace the corresponding variable in _data, and the others are 
> appended to _data."
> Also the resulting DataFrame might be hard to work with if one is to use 
> select with column names, or to register the table to SQL, and so on, since 
> then 2 columns have the same name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path

2015-08-28 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720889#comment-14720889
 ] 

Sun Rui commented on SPARK-8952:


I submitted https://issues.apache.org/jira/browse/SPARK-10347 to track this 
issue for a long-term solution.

> JsonFile() of SQLContext display improper warning message for a S3 path
> ---
>
> Key: SPARK-8952
> URL: https://issues.apache.org/jira/browse/SPARK-8952
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Sun Rui
>Assignee: Luciano Resende
> Fix For: 1.6.0, 1.5.1
>
>
> This is an issue reported by Ben Spark .
> {quote}
> Spark 1.4 deployed on AWS EMR 
> "jsonFile" is working though with some warning message
> Warning message:
> In normalizePath(path) :
>   
> path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": 
> No such file or directory
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10347) Investigate the usage of normalizePath()

2015-08-28 Thread Sun Rui (JIRA)
Sun Rui created SPARK-10347:
---

 Summary: Investigate the usage of normalizePath()
 Key: SPARK-10347
 URL: https://issues.apache.org/jira/browse/SPARK-10347
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Sun Rui
Priority: Minor


Currently normalizePath() is used in several places allowing users to specify 
paths via the use of tilde expansion, or to normalize a relative path to an 
absolute path. However, normalizePath() is used for paths which are actually 
expected to be a URI. normalizePath() may display warning messages when it does 
not recognize a URI as a local file path. So suppressWarnings() is used to 
suppress the possible warnings.

Worse than warnings, call normalizePath() on a URI may cause error. Because it 
may turn a user specified relative path to an absolute path using the local 
current directory, but this may not be true because the path is actually 
relative to the working directory of the default file system instead of the 
local file system (depends on the Hadoop configuration of Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior

2015-08-28 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-10346:


 Summary: SparkR mutate and transform should replace column with 
same name to match R data.frame behavior
 Key: SPARK-10346
 URL: https://issues.apache.org/jira/browse/SPARK-10346
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
Reporter: Felix Cheung


Spark doesn't seem to replace existing column with the name in mutate (ie. 
mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
name 'age'), so therefore not doing that for now in transform.

Though it is clearly stated it should replace column with matching name:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html

"The tags are matched against names(_data), and for those that match, the value 
replace the corresponding variable in _data, and the others are appended to 
_data."

Also the resulting DataFrame might be hard to work with if one is to use select 
with column names and so on.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720860#comment-14720860
 ] 

Feynman Liang edited comment on SPARK-7454 at 8/29/15 12:54 AM:


[~mengxr] do you mind assigning to me? Thanks!


was (Author: fliang):
[~mengxr] do you mind assigning to me and linking my PR, thanks!

> Perf test for power iteration clustering (PIC)
> --
>
> Key: SPARK-7454
> URL: https://issues.apache.org/jira/browse/SPARK-7454
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720860#comment-14720860
 ] 

Feynman Liang commented on SPARK-7454:
--

[~mengxr] do you mind assigning to me and linking my PR, thanks!

> Perf test for power iteration clustering (PIC)
> --
>
> Key: SPARK-7454
> URL: https://issues.apache.org/jira/browse/SPARK-7454
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720854#comment-14720854
 ] 

Feynman Liang commented on SPARK-7454:
--

https://github.com/databricks/spark-perf/pull/86

> Perf test for power iteration clustering (PIC)
> --
>
> Key: SPARK-7454
> URL: https://issues.apache.org/jira/browse/SPARK-7454
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-08-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720845#comment-14720845
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

The idea behind having `dapplyCollect` was that it might be easier to implement 
as the output doesn't necessarily need to be converted to a valid Spark 
DataFrame on the JVM and could instead be just any R data frame. But I agree 
that adding more keywords is confusing for users and we could avoid this in a 
couple of ways
(1) implement the type conversion from R to JVM first so we wouldn't need this 
(2) have a slightly different class on the JVM that only supports collect on it 
(i.e. not a DataFrame) and use that to 

Regarding gapply -- SparkR (and dplyr) already have a `group_by` function that 
does the grouping and in SparkR this returns a `GroupedData` object. Right now 
the only function available on the `GroupedData` object is `agg` to perform 
aggregations on it. We could instead support `dapply` on `GroupedData` objects 
and then the syntax would be something like

grouped_df <- group_by(df, df$city)
collect(dapply(grouped_df, function(group) {} ))

cc [~rxin]

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10345:
--

 Summary: Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate
 Key: SPARK-10345
 URL: https://issues.apache.org/jira/browse/SPARK-10345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41759/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/nonblock_op_deduplicate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-08-28 Thread Trevor Cai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720824#comment-14720824
 ] 

Trevor Cai commented on SPARK-9963:
---

[~lkhamsurenl]: Are you still working on this?
I took a quick look at the code, and it looks to me like it's not trivial to 
convert the binnedFeatures and splits arrays into a single vector mapping 
feature to threshold. The below section of code seems to indicate to me that 
some features don't have a threshold and should always move right, and using 
Double.MAX_VALUE as the threshold in those cases seems like it could 
potentially cause issues.
{code}
override private[tree] def shouldGoLeft(binnedFeature: Int, splits: 
Array[Split]): Boolean = {
if (binnedFeature == splits.length) {
  // > last split, so split right
  false
} else {
  val featureValueUpperBound = 
splits(binnedFeature).asInstanceOf[ContinuousSplit].threshold
  featureValueUpperBound <= threshold
}
  }
{code}

As a result, if you're still working on this, your second proposal makes more 
sense. If not, I can pick up the issue and implement the second option.

[~josephkb] Does this seem reasonable to you?


> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10344) Add tests for extraStrategies

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10344:


Assignee: Apache Spark  (was: Michael Armbrust)

> Add tests for extraStrategies
> -
>
> Key: SPARK-10344
> URL: https://issues.apache.org/jira/browse/SPARK-10344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10344) Add tests for extraStrategies

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10344:


Assignee: Michael Armbrust  (was: Apache Spark)

> Add tests for extraStrategies
> -
>
> Key: SPARK-10344
> URL: https://issues.apache.org/jira/browse/SPARK-10344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10344) Add tests for extraStrategies

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720820#comment-14720820
 ] 

Apache Spark commented on SPARK-10344:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8516

> Add tests for extraStrategies
> -
>
> Key: SPARK-10344
> URL: https://issues.apache.org/jira/browse/SPARK-10344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10344) Add tests for extraStrategies

2015-08-28 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-10344:


 Summary: Add tests for extraStrategies
 Key: SPARK-10344
 URL: https://issues.apache.org/jira/browse/SPARK-10344
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10326) Cannot launch YARN job on Windows

2015-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10326:

Target Version/s: 1.5.0  (was: 1.5.1)

> Cannot launch YARN job on Windows 
> --
>
> Key: SPARK-10326
> URL: https://issues.apache.org/jira/browse/SPARK-10326
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.5.0
>
>
> The fix is already in master, and it's one line out of the patch for 
> SPARK-5754; the bug is that a Windows file path cannot be used to create a 
> URI, to {{File.toURI()}} needs to be called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10334:


Assignee: Yin Huai  (was: Apache Spark)

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10339:


Assignee: Apache Spark  (was: Yin Huai)

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720749#comment-14720749
 ] 

Apache Spark commented on SPARK-10334:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8515

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10326) Cannot launch YARN job on Windows

2015-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10326.
-
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.5.0

> Cannot launch YARN job on Windows 
> --
>
> Key: SPARK-10326
> URL: https://issues.apache.org/jira/browse/SPARK-10326
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.5.0
>
>
> The fix is already in master, and it's one line out of the patch for 
> SPARK-5754; the bug is that a Windows file path cannot be used to create a 
> URI, to {{File.toURI()}} needs to be called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10334:


Assignee: Apache Spark  (was: Yin Huai)

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10339:


Assignee: Yin Huai  (was: Apache Spark)

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720748#comment-14720748
 ] 

Apache Spark commented on SPARK-10339:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8515

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-08-28 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720747#comment-14720747
 ] 

Suresh Thalamati commented on SPARK-9078:
-

@Bob, Reynold

I ran into the same issue when trying to write data frames into an existing 
table in DB2 database.
if you are not working on the fix, I would like to give it a try. 

Thank you for the analysis of the issue , my understanding is fix should do the 
following to address table exists problem with LIMIT syntax.:

-- Add table Exists method to the JdbcDialect interface, and allow dialects 
override the method as required for specific databases.
-- Default implementation of  table exists method should use 
DatabaseMetaData.getTables() to find if table exists. If that particular 
interface is not implement use  the query "select 1 from $table where 1=0".
-- Add table exist method that use LIMIT query to  MySQL , and Postgres 
dialects.

* Enhancing registering of dialect :  (I think this may have to be separate 
Jira to avoid confusion).

@Reynold : I am not understanding your comment on adding  option to pass 
through  the jdbc data source. If you can give an example that will be great. 

Are you referring to some thing like the following ?
 df.write.option("datasource.jdbc.dialects" 
"org.apache.DerbyDialect").jdbc("jdbc:deryby:// Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Priority: Minor
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-10334:


Assignee: Yin Huai

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10321.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

> OrcRelation doesn't override sizeInBytes
> 
>
> Key: SPARK-10321
> URL: https://issues.apache.org/jira/browse/SPARK-10321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.2, 1.5.0
>
>
> This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9957) Spark ML trees should filter out 1-category features

2015-08-28 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720720#comment-14720720
 ] 

holdenk commented on SPARK-9957:


I can give this a shot :)

> Spark ML trees should filter out 1-category features
> 
>
> Key: SPARK-9957
> URL: https://issues.apache.org/jira/browse/SPARK-9957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark ML trees should filter out 1-category categorical features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10222) Deprecate, retire Bagel in favor of GraphX

2015-08-28 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720709#comment-14720709
 ] 

holdenk commented on SPARK-10222:
-

For what its worth that sounds like a good plan to me, I occasionally get 
people asking me which one they should use so reducing confusion by removing it 
sounds like a good plan.

> Deprecate, retire Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-10339:


Assignee: Yin Huai

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map

2015-08-28 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720706#comment-14720706
 ] 

holdenk commented on SPARK-10296:
-

mapValues won't expose the key that its being operated on (its possible to have 
something where you aren't going to change the key but the change to the value 
depends on in part the key).

> add preservesParitioning parameter to RDD.map
> -
>
> Key: SPARK-10296
> URL: https://issues.apache.org/jira/browse/SPARK-10296
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Esteban Donato
>Priority: Minor
>
> It would be nice to add the Boolean parameter preservesParitioning with 
> default false to RDD.map method just as it is in RDD.mapPartitions method.
> If you agree I can submit a pull request with this enhancement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10341:
---
Target Version/s: 1.5.0  (was: 1.5.1)

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10175) Enhance spark doap file

2015-08-28 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720638#comment-14720638
 ] 

Luciano Resende commented on SPARK-10175:
-

ping

> Enhance spark doap file
> ---
>
> Key: SPARK-10175
> URL: https://issues.apache.org/jira/browse/SPARK-10175
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Luciano Resende
> Attachments: SPARK-10175
>
>
> The Spark doap has broken links and is also missing entries related to issue 
> tracker and mailing lists. This affects the list in projects.apache.org and 
> also in the main apache website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10299) word2vec should allow users to specify the window size

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10299:


Assignee: Apache Spark

> word2vec should allow users to specify the window size
> --
>
> Key: SPARK-10299
> URL: https://issues.apache.org/jira/browse/SPARK-10299
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently word2vec has the window hard coded at 5, some users may want 
> different sizes (for example if using on n-gram input or similar). User 
> request comes from 
> http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10299) word2vec should allow users to specify the window size

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10299:


Assignee: (was: Apache Spark)

> word2vec should allow users to specify the window size
> --
>
> Key: SPARK-10299
> URL: https://issues.apache.org/jira/browse/SPARK-10299
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Currently word2vec has the window hard coded at 5, some users may want 
> different sizes (for example if using on n-gram input or similar). User 
> request comes from 
> http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10299) word2vec should allow users to specify the window size

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720651#comment-14720651
 ] 

Apache Spark commented on SPARK-10299:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8513

> word2vec should allow users to specify the window size
> --
>
> Key: SPARK-10299
> URL: https://issues.apache.org/jira/browse/SPARK-10299
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Currently word2vec has the window hard coded at 5, some users may want 
> different sizes (for example if using on n-gram input or similar). User 
> request comes from 
> http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10323) NPE in code-gened In expression

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10323.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8492
[https://github.com/apache/spark/pull/8492]

> NPE in code-gened In expression
> ---
>
> Key: SPARK-10323
> URL: https://issues.apache.org/jira/browse/SPARK-10323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> To reproduce the problem, you can run {{null in ('str')}}. Let's also take a 
> look InSet and other similar expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10338) network/yarn is missing from sbt build

2015-08-28 Thread Nilanjan Raychaudhuri (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nilanjan Raychaudhuri closed SPARK-10338.
-
Resolution: Cannot Reproduce

> network/yarn is missing from sbt build
> --
>
> Key: SPARK-10338
> URL: https://issues.apache.org/jira/browse/SPARK-10338
> Project: Spark
>  Issue Type: Bug
>  Components: Build, YARN
>Affects Versions: 1.5.0
>Reporter: Nilanjan Raychaudhuri
>
> The following command fails with missing LoggingFactory class compilation 
> error. I am running this on current master version.
> build/sbt -Pyarn -Phadoop-2.3 assembly
> One fix is to add the following to the respective pom.xml file
>   
>org.slf4j
>  slf4j-api
>   provided
>  
> I am not sure whether this compilation error only specific to sbt or mvn also 
> has this problem.
> And for sbt to recognize the network/yarn project we might also have to add 
> the module to the parent pom.xml file.
> wdyt?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10343) Consider nullability of expression in codegen

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10343:
--

 Summary: Consider nullability of expression in codegen
 Key: SPARK-10343
 URL: https://issues.apache.org/jira/browse/SPARK-10343
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Priority: Critical


In codegen, we didn't consider nullability of expressions. Once considering 
this, we can avoid lots of null check (reduce the size of generated code, also 
improve performance).

Before that, we should double-check the correctness of nullablity of all 
expressions and schema, or we will hit NPE or wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10338) network/yarn is missing from sbt build

2015-08-28 Thread Nilanjan Raychaudhuri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720600#comment-14720600
 ] 

Nilanjan Raychaudhuri commented on SPARK-10338:
---

You are right. I tried again with a clean repo.  My sbt cache might have been 
messed up.

> network/yarn is missing from sbt build
> --
>
> Key: SPARK-10338
> URL: https://issues.apache.org/jira/browse/SPARK-10338
> Project: Spark
>  Issue Type: Bug
>  Components: Build, YARN
>Affects Versions: 1.5.0
>Reporter: Nilanjan Raychaudhuri
>
> The following command fails with missing LoggingFactory class compilation 
> error. I am running this on current master version.
> build/sbt -Pyarn -Phadoop-2.3 assembly
> One fix is to add the following to the respective pom.xml file
>   
>org.slf4j
>  slf4j-api
>   provided
>  
> I am not sure whether this compilation error only specific to sbt or mvn also 
> has this problem.
> And for sbt to recognize the network/yarn project we might also have to add 
> the module to the parent pom.xml file.
> wdyt?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720581#comment-14720581
 ] 

Apache Spark commented on SPARK-10340:
--

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8512

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10342:
---
Issue Type: Improvement  (was: Story)

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10341:


Assignee: Apache Spark  (was: Davies Liu)

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10336:
--
Fix Version/s: (was: 1.5.0)
   1.5.1

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.1
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10340:


Assignee: (was: Apache Spark)

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10340:


Assignee: Apache Spark

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Apache Spark
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10342:

Assignee: (was: Davies Liu)

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10342:

Issue Type: Story  (was: New Feature)

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10342:

Issue Type: New Feature  (was: Bug)

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720573#comment-14720573
 ] 

DB Tsai commented on SPARK-10336:
-

Thanks for head up. 

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.1
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720559#comment-14720559
 ] 

Xiangrui Meng commented on SPARK-10336:
---

[~dbtsai] Please check the JIRA fields after you merge a PR. When a RC release 
is in vote, we should set the fix version to 1.5.1 instead of 1.5.0. If there 
exist a new RC, there will be a batch job to modify 1.5.1 back to 1.5.0. For 
bugs, affects versions should be set as well.

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.1
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10342:
--

 Summary: Cooperative memory management
 Key: SPARK-10342
 URL: https://issues.apache.org/jira/browse/SPARK-10342
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


We have memory starving problems for a long time, it become worser in 1.5 since 
we use larger page.

In order to increase the memory usage (reduce unnecessary spilling) also reduce 
the risk of OOM, we should manage the memory in a cooperative way, it means all 
the memory consume should be also responsive to release memory (spilling) upon 
others' requests.

The requests of memory could be different, hard requirement (will crash if not 
allocated) or soft requirement (worse performance if not allocated). Also the 
costs of spilling are also different. We could introduce some kind of priority 
to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9671) ML 1.5 QA: Programming guide update and migration guide

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9671.
--
   Resolution: Fixed
Fix Version/s: 1.5.1

Issue resolved by pull request 8498
[https://github.com/apache/spark/pull/8498]

> ML 1.5 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-9671
> URL: https://issues.apache.org/jira/browse/SPARK-9671
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.5.1
>
>
> Before the release, we need to update the MLlib Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs.
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Possibly reorganize parts of the Pipelines guide if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10336:
--
Affects Version/s: 1.5.0
   1.4.1
  Component/s: ML
   Documentation

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.1
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720552#comment-14720552
 ] 

Apache Spark commented on SPARK-10341:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8511

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10341:


Assignee: Davies Liu  (was: Apache Spark)

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10341:
--

 Summary: SMJ fail with unable to acquire memory
 Key: SPARK-10341
 URL: https://issues.apache.org/jira/browse/SPARK-10341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


In SMJ, the first ExternalSorter could consume all the memory before spilling, 
then the second can not even acquire the first page.

{code}
ava.io.IOException: Unable to acquire 16777216 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10338) network/yarn is missing from sbt build

2015-08-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720527#comment-14720527
 ] 

Sean Owen commented on SPARK-10338:
---

I can't reproduce any problem with the sbt build in master, using the same 
command. I'm not clear what your error is, or why it's related to network/yarn 
or slf4j. I'd have to close as CannotReproduce unless you can elaborate.

> network/yarn is missing from sbt build
> --
>
> Key: SPARK-10338
> URL: https://issues.apache.org/jira/browse/SPARK-10338
> Project: Spark
>  Issue Type: Bug
>  Components: Build, YARN
>Affects Versions: 1.5.0
>Reporter: Nilanjan Raychaudhuri
>
> The following command fails with missing LoggingFactory class compilation 
> error. I am running this on current master version.
> build/sbt -Pyarn -Phadoop-2.3 assembly
> One fix is to add the following to the respective pom.xml file
>   
>org.slf4j
>  slf4j-api
>   provided
>  
> I am not sure whether this compilation error only specific to sbt or mvn also 
> has this problem.
> And for sbt to recognize the network/yarn project we might also have to add 
> the module to the parent pom.xml file.
> wdyt?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.

2015-08-28 Thread Dan McClary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720515#comment-14720515
 ] 

Dan McClary commented on SPARK-10298:
-

I can't seem to replicate this on master.  The following scenarios work:

from decimal import *
import pyspark.sql.types as types
from pyspark.sql import *

v = [Decimal(10.0-i)/Decimal(1.0+1) for i in range(10)]
d = sc.parallelize(v).map(lambda x: Row(v=x)).toDF()
d.write.json("foo")

v2 = [[Decimal(10.0-i)/Decimal(1.0+1)] for i in range(10)]
d2= sc.parallelize(v)
df = sqlContext.createDataFrame(d2, types.StructType([types.StructField("a", 
types.DecimalType())]))
df.write.json("foo")


> PySpark can't JSON serialize a DataFrame with DecimalType columns.
> --
>
> Key: SPARK-10298
> URL: https://issues.apache.org/jira/browse/SPARK-10298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Cox
>
> {code}
> In [8]: sc.sql.createDataFrame([[Decimal(123)]], 
> types.StructType([types.StructField("a", types.DecimalType())]))
> Out[8]: DataFrame[a: decimal(10,0)]
> In [9]: _.write.json("foo")
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task.
> scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt 
> attempt_201508261426__m_00_0 aborted.
> 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$

[jira] [Created] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-28 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created SPARK-10340:
-

 Summary: Use S3 bulk listing for S3-backed Hive tables
 Key: SPARK-10340
 URL: https://issues.apache.org/jira/browse/SPARK-10340
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheolsoo Park


AWS S3 provides bulk listing API. It takes the common prefix of all input paths 
as a parameter and returns all the objects whose prefixes start with the common 
prefix in blocks of 1000.

Since SPARK-9926 allow us to list multiple partitions all together, we can 
significantly speed up input split calculation using S3 bulk listing. This 
optimization is particularly useful for queries like {{select * from 
partitioned_table limit 10}}.

This is a common optimization for S3. For eg, here is a [blog 
post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] from 
Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10339:
-
Description: 
I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
When I run the following code, the free memory space in driver's old gen 
gradually decreases and eventually there is pretty much no free space in 
driver's old gen. Finally, all kinds of timeouts happen and the cluster is died.
{code}
val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
=> Unit)
{code}

I did a quick test by deleting SQL metrics from project and filter operator, my 
job works fine.

The reason is that for a partitioned table, when we scan it, the actual plan is 
like
{code}
   other operators
   |
   |
/--|--\
   /   |   \
  /|\
 / | \
project  project ... project
  ||   |
filter   filter  ... filter
  ||   |
part1part2   ... part n
{code}

We create SQL metrics for every filter and project, which causing the extremely 
high memory pressure to the driver.

  was:
I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
When I run the following code, the free memory space in driver's old gen 
gradually decrease and eventually there is no free space in driver's old gen.
{code}
val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
=> Unit)
{code}

I did a quick test by deleting SQL metrics from project and filter operator, my 
job works fine.

The reason is that for a partitioned table, when we scan it, the actual plan is 
like
{code}
   other operators
   |
   |
/--|--\
   /   |   \
  /|\
 / | \
project  project ... project
  ||   |
filter   filter  ... filter
  ||   |
part1part2   ... part n
{code}

We create SQL metrics for every filter and project, which causing the extremely 
high memory pressure to the driver.


> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10339:
-
Sprint: Spark 1.5 doc/QA sprint

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decrease and eventually there is no free space in driver's old gen.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-10336:

Assignee: Shuo Xiang

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.0
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-28 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10339:


 Summary: When scanning a partitioned table having thousands of 
partitions, Driver has a very high memory pressure because of SQL metrics
 Key: SPARK-10339
 URL: https://issues.apache.org/jira/browse/SPARK-10339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Blocker


I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
When I run the following code, the free memory space in driver's old gen 
gradually decrease and eventually there is no free space in driver's old gen.
{code}
val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
=> Unit)
{code}

I did a quick test by deleting SQL metrics from project and filter operator, my 
job works fine.

The reason is that for a partitioned table, when we scan it, the actual plan is 
like
{code}
   other operators
   |
   |
/--|--\
   /   |   \
  /|\
 / | \
project  project ... project
  ||   |
filter   filter  ... filter
  ||   |
part1part2   ... part n
{code}

We create SQL metrics for every filter and project, which causing the extremely 
high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-10336.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8510
[https://github.com/apache/spark/pull/8510]

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
> Fix For: 1.5.0
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10337) Views are broken

2015-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10337:
-
Description: 
I haven't dug into this yet... but it seems like this should work:

This works:
{code}
SELECT * FROM 100milints
{code}

This seems to work:
{code}
CREATE VIEW testView AS SELECT * FROM 100milints
{code}

This fails:
{code}
SELECT * FROM testView

org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
input columns id; line 1 pos 7
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:126)
at 

[jira] [Created] (SPARK-10338) network/yarn is missing from sbt build

2015-08-28 Thread Nilanjan Raychaudhuri (JIRA)
Nilanjan Raychaudhuri created SPARK-10338:
-

 Summary: network/yarn is missing from sbt build
 Key: SPARK-10338
 URL: https://issues.apache.org/jira/browse/SPARK-10338
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.5.0
Reporter: Nilanjan Raychaudhuri


The following command fails with missing LoggingFactory class compilation 
error. I am running this on current master version.

build/sbt -Pyarn -Phadoop-2.3 assembly

One fix is to add the following to the respective pom.xml file

  
   org.slf4j
 slf4j-api
  provided
 

I am not sure whether this compilation error only specific to sbt or mvn also 
has this problem.

And for sbt to recognize the network/yarn project we might also have to add the 
module to the parent pom.xml file.

wdyt?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10337) Views are broken

2015-08-28 Thread Ali Ghodsi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720450#comment-14720450
 ] 

Ali Ghodsi commented on SPARK-10337:


[~marmbrus] do you mind updating the the first part of the description (the 
code snippet after {{This works:}})? It's blank on my screen. 

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.

[jira] [Assigned] (SPARK-10336) Setting intercept in the example code of logistic regression is not working

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10336:


Assignee: (was: Apache Spark)

> Setting intercept in the example code of logistic regression is not working
> ---
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10336) Setting intercept in the example code of logistic regression is not working

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720447#comment-14720447
 ] 

Apache Spark commented on SPARK-10336:
--

User 'coderxiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8510

> Setting intercept in the example code of logistic regression is not working
> ---
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-28 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-10336:
---
Summary: fitIntercept is a command line option but not set in the LR 
example program.  (was: Setting intercept in the example code of logistic 
regression is not working)

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10336) Setting intercept in the example code of logistic regression is not working

2015-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10336:


Assignee: Apache Spark

> Setting intercept in the example code of logistic regression is not working
> ---
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>Reporter: Shuo Xiang
>Assignee: Apache Spark
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10337) Views are broken

2015-08-28 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-10337:


 Summary: Views are broken
 Key: SPARK-10337
 URL: https://issues.apache.org/jira/browse/SPARK-10337
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Michael Armbrust
Priority: Critical


I haven't dug into this yet... but it seems like this should work:

This works:
{code}
{code}

This seems to work:
{code}
CREATE VIEW testView AS SELECT * FROM 100milints
{code}

This fails:
{code}
SELECT * FROM testView

org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
input columns id; line 1 pos 7
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.Ab

[jira] [Created] (SPARK-10336) Setting intercept in the example code of logistic regression is not working

2015-08-28 Thread Shuo Xiang (JIRA)
Shuo Xiang created SPARK-10336:
--

 Summary: Setting intercept in the example code of logistic 
regression is not working
 Key: SPARK-10336
 URL: https://issues.apache.org/jira/browse/SPARK-10336
 Project: Spark
  Issue Type: Bug
Reporter: Shuo Xiang


the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9284) Remove tests' dependency on the assembly

2015-08-28 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9284.
---
Resolution: Fixed
  Assignee: Marcelo Vanzin

> Remove tests' dependency on the assembly
> 
>
> Key: SPARK-9284
> URL: https://issues.apache.org/jira/browse/SPARK-9284
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> Some tests - in particular tests that have to spawn child processes - 
> currently rely on the generated Spark assembly to run properly.
> This is sub-optimal for a few reasons:
> - Users have to use an unnatural "package everything first, then run tests" 
> approach
> - Sometimes tests are run using old code because the user forgot to rebuild 
> the assembly
> The latter is particularly annoying in {{YarnClusterSuite}}. If you modify 
> some code outside of the {{yarn/}} module, you have to rebuild the whole 
> assembly before that test picks it up.
> We should make all tests run without the need to have an assembly around, 
> making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10325) Public Row no longer overrides hashCode / violates hashCode + equals contract

2015-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10325:
-
Assignee: Josh Rosen

> Public Row no longer overrides hashCode / violates hashCode + equals contract
> -
>
> Key: SPARK-10325
> URL: https://issues.apache.org/jira/browse/SPARK-10325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.0
>
>
> It appears that the public {{Row}}'s hashCode is no longer overridden as of 
> Spark 1.5.0:
> {code}
> val x = Row("Hello")
> val y = Row("Hello")
> println(x == y)
> println(x.hashCode)
> println(y.hashCode)
> {code}
> outputs
> {code}
> true
> 1032103993
> 1346393532
> {code}
> This violates the hashCode/equals contract.
> I discovered this because it broke tests in the {{spark-avro}} library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10325) Public Row no longer overrides hashCode / violates hashCode + equals contract

2015-08-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10325.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8500
[https://github.com/apache/spark/pull/8500]

> Public Row no longer overrides hashCode / violates hashCode + equals contract
> -
>
> Key: SPARK-10325
> URL: https://issues.apache.org/jira/browse/SPARK-10325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Priority: Critical
> Fix For: 1.5.0
>
>
> It appears that the public {{Row}}'s hashCode is no longer overridden as of 
> Spark 1.5.0:
> {code}
> val x = Row("Hello")
> val y = Row("Hello")
> println(x == y)
> println(x.hashCode)
> println(y.hashCode)
> {code}
> outputs
> {code}
> true
> 1032103993
> 1346393532
> {code}
> This violates the hashCode/equals contract.
> I discovered this because it broke tests in the {{spark-avro}} library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10335) GraphX Connected Components fail with large number of iterations

2015-08-28 Thread JJ Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720385#comment-14720385
 ] 

JJ Zhang commented on SPARK-10335:
--

Note that I've changed the return signature of the algorithm to lower the 
amount of data that need to be checkpointed. You can always join back to graph 
to get additional info
Also changed pregel signature to return number of iterations. I think this may 
be a good idea for the general pregel API as well, since this is a general 
issue for all iterative graph algorithms.
For simple test, I used code below:
{code}
import org.apache.spark.graphx._
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
/**
 * @author jj.zhang
 */
object SampleGraph extends java.io.Serializable {
  
  /**
   * Create a graph with some long chains of connection to test 
ConnectedComponents
   */
  def createGraph(sc : SparkContext) : Graph[VertexId, String] = {
val v :RDD[(VertexId, VertexId)] = sc.parallelize(0 until 50, 10).map(r => 
(r.toLong, r.toLong))

val v2 :RDD[(VertexId, VertexId)] = sc.parallelize(1001 until 1020, 
10).map(r => (r.toLong, r.toLong))

val v3 :RDD[(VertexId, VertexId)] = sc.parallelize(2000 until 2200, 
10).map(r => (r.toLong, r.toLong))

val e : RDD[Edge[String]] = sc.parallelize(0 until  49, 10).map(r => 
Edge(r, r + 1, ""))

val e2 : RDD[Edge[String]] = sc.parallelize(1001 until 1019, 10).map(r => 
Edge(r, r + 1, ""))

val e3 : RDD[Edge[String]] = sc.parallelize(2000 until 2199, 10).map(r => 
Edge(r, r + 1, ""))


val g : Graph[VertexId, String] = Graph(v ++ v2 ++ v3, e ++ e2 ++ e3) 
g.partitionBy(PartitionStrategy.EdgePartition2D)
  }
}

val g = SampleGraph.createGraph(sc)
val dir = "hdfs://${myhost}/user/me/run1"
val (g2, count) = RobustConnectedComponents.run(g, 20, dir)
{code}

You should see converge in 199 iterations, and all intermediate files on your 
hdfs.

> GraphX Connected Components fail with large number of iterations
> 
>
> Key: SPARK-10335
> URL: https://issues.apache.org/jira/browse/SPARK-10335
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.4.0
> Environment: Tested with Yarn client mode
>Reporter: JJ Zhang
>
> For graphs with long chains of connected vertices, the algorithm fails in 
> practice to converge.
> The driver always runs out of memory prior to final convergence. In my test, 
> for 1.2B vertices/1.2B edges with the longest path requiring more than 90 
> iterations, it invariably fails around iteration 80 with driver memory set at 
> 50G. On top of that, each iteration takes longer than previous, and it took 
> an overnight run on 50 node cluster to the failure point.
> It is presumably due to keeping track of the RDD lineage and the DAG 
> computation. So truncate RDD lineage is the most straight-forward solution.
> Tried using checkpoint, but apparently it is not working as expected in 
> version 1.4.0. Will file another ticket for that issue, but hopefully it has 
> been resolved by 1.5.
> Proposed solution below. This was tested and able to converge after 99 
> iterations on the same graph mentioned above, in less than an hour.
> {code: title=RobustConnectedComponents.scala|borderStyle=solid}
> import org.apache.spark.Logging
> import scala.reflect.ClassTag
> import org.apache.spark.graphx._
> object RobustConnectedComponents extends Logging with java.io.Serializable {
>   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], interval: Int = 
> 50, dir : String): (Graph[VertexId, String], Int) = {
> val ccGraph = graph.mapVertices { case (vid, _) => vid }.mapEdges { x 
> =>""}
> def sendMessage(edge: EdgeTriplet[VertexId, String]): Iterator[(VertexId, 
> Long)] = {
>   if (edge.srcAttr < edge.dstAttr) {
> Iterator((edge.dstId, edge.srcAttr))
>   } else if (edge.srcAttr > edge.dstAttr) {
> Iterator((edge.srcId, edge.dstAttr))
>   } else {
> Iterator.empty
>   }
> }
> val initialMessage = Long.MaxValue
>
> 
> var g: Graph[VertexId, String] = ccGraph
> var i = interval
> var count = 0
> while (i == interval) {
>   g = refreshGraph(g, dir, count)
>   g.cache()
>   val (g1, i1) = pregel(g, initialMessage, interval, activeDirection = 
> EdgeDirection.Either)(
> vprog = (id, attr, msg) => math.min(attr, msg),
> sendMsg = sendMessage,
> mergeMsg = (a, b) => math.min(a, b))
>   g.unpersist()
>   g = g1
>   i = i1
>   count = count + i
>   logInfo("Checkpoint reached. iteration so far: " + count)
> }
> logInfo("Final Converge: Total Iteration:" + count)
> (g, count)
>   } // end of connectedComponents
>   
>   def refreshGraph(g

[jira] [Created] (SPARK-10335) GraphX Connected Components fail with large number of iterations

2015-08-28 Thread JJ Zhang (JIRA)
JJ Zhang created SPARK-10335:


 Summary: GraphX Connected Components fail with large number of 
iterations
 Key: SPARK-10335
 URL: https://issues.apache.org/jira/browse/SPARK-10335
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
 Environment: Tested with Yarn client mode
Reporter: JJ Zhang


For graphs with long chains of connected vertices, the algorithm fails in 
practice to converge.

The driver always runs out of memory prior to final convergence. In my test, 
for 1.2B vertices/1.2B edges with the longest path requiring more than 90 
iterations, it invariably fails around iteration 80 with driver memory set at 
50G. On top of that, each iteration takes longer than previous, and it took an 
overnight run on 50 node cluster to the failure point.

It is presumably due to keeping track of the RDD lineage and the DAG 
computation. So truncate RDD lineage is the most straight-forward solution.

Tried using checkpoint, but apparently it is not working as expected in version 
1.4.0. Will file another ticket for that issue, but hopefully it has been 
resolved by 1.5.

Proposed solution below. This was tested and able to converge after 99 
iterations on the same graph mentioned above, in less than an hour.

{code: title=RobustConnectedComponents.scala|borderStyle=solid}
import org.apache.spark.Logging
import scala.reflect.ClassTag
import org.apache.spark.graphx._

object RobustConnectedComponents extends Logging with java.io.Serializable {

  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], interval: Int = 50, 
dir : String): (Graph[VertexId, String], Int) = {

val ccGraph = graph.mapVertices { case (vid, _) => vid }.mapEdges { x =>""}
def sendMessage(edge: EdgeTriplet[VertexId, String]): Iterator[(VertexId, 
Long)] = {
  if (edge.srcAttr < edge.dstAttr) {
Iterator((edge.dstId, edge.srcAttr))
  } else if (edge.srcAttr > edge.dstAttr) {
Iterator((edge.srcId, edge.dstAttr))
  } else {
Iterator.empty
  }
}
val initialMessage = Long.MaxValue
   

var g: Graph[VertexId, String] = ccGraph
var i = interval
var count = 0
while (i == interval) {
  g = refreshGraph(g, dir, count)
  g.cache()
  val (g1, i1) = pregel(g, initialMessage, interval, activeDirection = 
EdgeDirection.Either)(
vprog = (id, attr, msg) => math.min(attr, msg),
sendMsg = sendMessage,
mergeMsg = (a, b) => math.min(a, b))
  g.unpersist()
  g = g1
  i = i1
  count = count + i
  logInfo("Checkpoint reached. iteration so far: " + count)

}
logInfo("Final Converge: Total Iteration:" + count)
(g, count)
  } // end of connectedComponents


  
  def refreshGraph(g : Graph[VertexId, String], dir:String, count:Int): 
Graph[VertexId, String] = {
val vertFile = dir + "/iter-" + count + "/vertices"
val edgeFile = dir + "/iter-" + count + "/edges"
g.vertices.saveAsObjectFile(vertFile)
g.edges.saveAsObjectFile(edgeFile)

//load back
val v : RDD[(VertexId, VertexId)] = 
g.vertices.sparkContext.objectFile(vertFile)
val e : RDD[Edge[String]]= g.vertices.sparkContext.objectFile(edgeFile)

val newGraph = Graph(v, e)
newGraph
  }

  def pregel[VD: ClassTag, ED: ClassTag, A: ClassTag](graph: Graph[VD, ED],
initialMsg: A,
maxIterations: Int = Int.MaxValue,
activeDirection: EdgeDirection = EdgeDirection.Either)(vprog: (VertexId, 
VD, A) => VD,
  sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
  mergeMsg: (A, A) => A): (Graph[VD, ED], Int) =
{

  var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata,  
initialMsg)).cache()
  // compute the messages
  var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
  var activeMessages = messages.count()
  // Loop
  var prevG: Graph[VD, ED] = null
  var i = 0
  while (activeMessages > 0 && i < maxIterations) {
// Receive the messages and update the vertices.
prevG = g
g = g.joinVertices(messages)(vprog).cache()
val oldMessages = messages
// Send new messages, skipping edges where neither side received a 
message. We must cache
// messages so it can be materialized on the next line, allowing us to 
uncache the previous
// iteration.
messages = g.mapReduceTriplets(
  sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
// The call to count() materializes `messages` and the vertices of `g`. 
This hides oldMessages
// (depended on by the vertices of g) and the vertices of prevG 
(depended on by oldMessages
// and the vertices of g).
activeMessages = messages.count()
// Unpersist the RDDs hidden by newly-materialized RDDs
oldMessages.unpersist(blocking = false)
prevG.unpersistVer

[jira] [Commented] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.

2015-08-28 Thread Dan McClary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720333#comment-14720333
 ] 

Dan McClary commented on SPARK-10298:
-

I'll take a look at this.

> PySpark can't JSON serialize a DataFrame with DecimalType columns.
> --
>
> Key: SPARK-10298
> URL: https://issues.apache.org/jira/browse/SPARK-10298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Cox
>
> {code}
> In [8]: sc.sql.createDataFrame([[Decimal(123)]], 
> types.StructType([types.StructField("a", types.DecimalType())]))
> Out[8]: DataFrame[a: decimal(10,0)]
> In [9]: _.write.json("foo")
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task.
> scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt 
> attempt_201508261426__m_00_0 aborted.
> 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spa

[jira] [Created] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10334:


 Summary: Partitioned table scan's query plan does not show Filter 
and Project on top of the table scan
 Key: SPARK-10334
 URL: https://issues.apache.org/jira/browse/SPARK-10334
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai


{code}
Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
"j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
val df1 = sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
df1.selectExpr("hash(i)", "hash(j)").show
df1.filter("hash(j) = 1").explain
== Physical Plan ==
Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
{code}

Looks like the reason is that we correctly apply the project and filter. Then, 
we create an RDD for the result and then manually create a PhysicalRDD. So, the 
Project and Filter on top of the original table scan disappears from the 
physical  plan.

See 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175

We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10334:
-
Target Version/s: 1.6.0, 1.5.1
Priority: Critical  (was: Major)

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720277#comment-14720277
 ] 

Feynman Liang commented on SPARK-10199:
---

[~vinodkc] would it be possible to get some microbenchmarks? You can surround 
the call to 
[createDataFrame|https://github.com/apache/spark/pull/8507/files#diff-13d1de98ab7ae677f9b345eb90a8b8e8R237]
 with some timing code before and after the change.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10332) spark-submit to yarn doesn't fail if executors is 0

2015-08-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720276#comment-14720276
 ] 

Marcelo Vanzin commented on SPARK-10332:


You can call {{SparkContext::requestExecutors}} even with dynamic allocation 
off, right? Still, I agree that it's a little weird to not at least print a 
nasty warning to the user in that case.

> spark-submit to yarn doesn't fail if executors is 0
> ---
>
> Key: SPARK-10332
> URL: https://issues.apache.org/jira/browse/SPARK-10332
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>
> Running spark-submit with yarn with number-executors equal to 0 when not 
> using dynamic allocation should error out.  
> In spark 1.5.0 it continues and ends up hanging.
> yarn.ClientArguments still has the check so something else must have changed.
> spark-submit   --master yarn --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --num-executors 0 
> spark 1.4.1 errors with:
> java.lang.IllegalArgumentException:
> Number of executors was 0, but must be at least 1
> (or 0 if dynamic executor allocation is enabled).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10333) Add user guide for linear-methods.md columns

2015-08-28 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-10333:
-

 Summary: Add user guide for linear-methods.md columns
 Key: SPARK-10333
 URL: https://issues.apache.org/jira/browse/SPARK-10333
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Priority: Minor


Add example code to document input output columns based on 
https://github.com/apache/spark/pull/8491 feedback



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10055) San Francisco Crime Classification

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720245#comment-14720245
 ] 

Xiangrui Meng edited comment on SPARK-10055 at 8/28/15 5:05 PM:


Thanks for posting the feedback! ~~For spark-csv, in the master branch there is 
an option to infer schema automatically.~~ Do you mind checking `RFormula`? It 
handles the string indexers and one-hot encoding automatically.


was (Author: mengxr):
Thanks for posting the feedback! For spark-csv, in the master branch there is 
an option to infer schema automatically. Do you mind checking `RFormula`? It 
handles the string indexers and one-hot encoding automatically.

> San Francisco Crime Classification
> --
>
> Key: SPARK-10055
> URL: https://issues.apache.org/jira/browse/SPARK-10055
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10055) San Francisco Crime Classification

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720245#comment-14720245
 ] 

Xiangrui Meng edited comment on SPARK-10055 at 8/28/15 5:05 PM:


Thanks for posting the feedback! -For spark-csv, in the master branch there is 
an option to infer schema automatically.- Do you mind checking `RFormula`? It 
handles the string indexers and one-hot encoding automatically.


was (Author: mengxr):
Thanks for posting the feedback! ~~For spark-csv, in the master branch there is 
an option to infer schema automatically.~~ Do you mind checking `RFormula`? It 
handles the string indexers and one-hot encoding automatically.

> San Francisco Crime Classification
> --
>
> Key: SPARK-10055
> URL: https://issues.apache.org/jira/browse/SPARK-10055
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >