date:20170111

[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820389#comment-15820389
 ] 

Felix Cheung commented on SPARK-18823:
--

Yap. I'll start on this shortly.

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19179) spark.yarn.access.namenodes description is wrong

2017-01-11 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820371#comment-15820371
 ] 

Saisai Shao commented on SPARK-19179:
-

Thanks [~tgraves] to point out the left thing, let me handle it.

> spark.yarn.access.namenodes description is wrong
> 
>
> Key: SPARK-19179
> URL: https://issues.apache.org/jira/browse/SPARK-19179
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
>Reporter: Thomas Graves
>Priority: Minor
>
> The description and name of spark.yarn.access.namenodesis off.  It 
> says this is for HDFS namenodes when really this is to specify any hadoop 
> filesystems.  It gets the credentials for those filesystems.
> We should at least update the description on it to be more generic.  We could 
> change the name on it but we would have to deprecated it and keep around 
> current name as many people use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12076) countDistinct behaves inconsistently

2017-01-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12076.
--
Resolution: Cannot Reproduce

I strongly think no one is going to reproduce this. I tried to imagine and 
generate data and test it for few hours but I can't reproduce.

I am resolving this as {{Cannot Reproduce}}. Please reopen this if anyone can 
verify this still exists in the current master. 

> countDistinct behaves inconsistently
> 
>
> Key: SPARK-12076
> URL: https://issues.apache.org/jira/browse/SPARK-12076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Paul Zaczkieiwcz
>Priority: Minor
>
> Assume:
> {code:java}
> val slicePlayed:DataFrame = _
> val joinKeys:DataFrame = _
> {code}
> Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} 
> and all columns beginning with "join_" are from {{joinKeys}}.  The following 
> queries can return different values for slice_count_distinct:
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number")
> ).show(false)
> {code}
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   min($"cdnt_event_time").as("slice_start_time"),
>   min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"),
>   min($"cdnt_user_ip").as("slice_played_user_ip"),
>   min($"cdnt_user_agent").as("slice_played_user_agent"),
>   min($"cdnt_referer").as("slice_played_referer"),
>   max($"cdnt_event_time").as("slice_end_time"),
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number"),
>   min($"cdnt_is_live").as("is_live")
> ).show(false)
> {code}
> The +only+ difference between the two queries are that I'm adding more 
> columns to the {{agg}} method.
> I can't reproduce by manually creating a dataFrame from 
> {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet 
> files.
> The explain plans for the two queries are slightly different.
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDistinct=false),(count(cdnt_slice_number#24L),mode=Complete,isDistinct=true)],
>  
> output=[slice_played_session_id#780,slice_played_asset_id#781,slice_played_euid#782,slice_count_distinct#783L,slice_count_total#784L,min_slice_number#785L,max_slice_number#786L])
>  
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L],
>  
> functions=[(count(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(min(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(max(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false)],
>  
> output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L])
>   
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L],
>  
> functions=[(count(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(min(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(max(cdnt_slice_number#24L),mode=Partial,isDistinct=false)],
>  
> output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L])
>TungstenProject 
> [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L]
> SortMergeJoin [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> [join_session_id#41,join_asset_id#42,join_euid#43]
>  TungstenSort [cdnt_session_id#23 ASC,cdnt_asset_id#5 ASC,cdnt_euid#13 
> ASC], false, 0
>   TungstenExchange 
> hashpartitioning(cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13)
>ConvertToUnsafe
>

[jira] [Assigned] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18693:


Assignee: Apache Spark

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>Assignee: Apache Spark
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18693:


Assignee: (was: Apache Spark)

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820273#comment-15820273
 ] 

Apache Spark commented on SPARK-18693:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/16557

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16848) Check schema validation for user-specified schema in jdbc and table APIs

2017-01-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-16848.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

> Check schema validation for user-specified schema in jdbc and table APIs
> 
>
> Key: SPARK-16848
> URL: https://issues.apache.org/jira/browse/SPARK-16848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Currently,
> Both APIs below:
> {code}
> spark.read.schema(StructType(Nil)).jdbc(...)
> {code}
>  and 
> {code}
> spark.read.schema(StructType(Nil)).table("usrdb.test")
> {code}
> ignore schemas.
> It'd make sense to throw an exception rather than causing confusion to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:59 AM:
-

[~debasish83]  Considering make VF-LBFGS/VF-OWLQN supporting generic 
distributed vector interface (move into breeze ?) and make them support 
multiple distributed platform(not only spark) will make the optimization 
against spark platform difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many 
optimizations need to combine spark features and the optimizer algorithm 
closely, make a abstract interface supporting distributed vector (for example, 
Vector space operator include dot, add, scale, persist/unpersist operators and 
so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 

As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface:
* FirstOrderMinimizer
* DiffFunction interface
* LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
* DistributedVector abstract interface

*BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.


was (Author: weichenxu123):
[~debasish83]  Considering make VF-LBFGS/VF-OWLQN supporting generic 
distributed vector interface (move into breeze ?) and make them support 
multiple distributed platform(not only spark) will make the optimization 
against spark platform difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many 
optimizations need to combine spark features and the optimizer algorithm 
closely, make a abstract interface supporting distributed vector (for example, 
Vector space operator include dot, add, scale, persist/unpersist operators and 
so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:58 AM:
-

[~debasish83]  Considering make VF-LBFGS/VF-OWLQN supporting generic 
distributed vector interface (move into breeze ?) and make them support 
multiple distributed platform(not only spark) will make the optimization 
against spark platform difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many 
optimizations need to combine spark features and the optimizer algorithm 
closely, make a abstract interface supporting distributed vector (for example, 
Vector space operator include dot, add, scale, persist/unpersist operators and 
so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 

As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface:
* FirstOrderMinimizer
* DiffFunction interface
* LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
* DistributedVector abstract interface
*BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.


was (Author: weichenxu123):
[~debasish83] 
Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector 
interface (move into breeze ?) and make them support multiple distributed 
platform(not only spark) will make the optimization against spark platform 
difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many 
optimizations need to combine spark features and the optimizer algorithm 
closely, make a abstract interface supporting distributed vector (for example, 
Vector space operator include dot, add, scale, persist/unpersist operators and 
so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM:
-

As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface:

   FirstOrderMinimizer
   DiffFunction interface
   LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
   DistributedVector abstract interface

BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.



was (Author: weichenxu123):
As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface:

   FirstOrderMinimizerlevel
   DiffFunction interface
   LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
   DistributedVector abstract interface

BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.


> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM:
-

As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface:

   FirstOrderMinimizerlevel
   DiffFunction interface
   LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
   DistributedVector abstract interface

BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.



was (Author: weichenxu123):
As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface
 FirstOrderMinimizerlevel
 DiffFunction interface
 LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
 DistributedVector abstract interface

BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.


> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180
 ] 

Weichen Xu commented on SPARK-10078:


As the detail problems I list above(I only list a small part problems), in my 
opinion, breeze can provide the following base class and/or abstract interface
 FirstOrderMinimizerlevel
 DiffFunction interface
 LineSearch implementation (including StrongWolfeLinsearch and 
BacktrackingLinesearch)
 DistributedVector abstract interface

BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be 
implemented in spark mllib, for better optimization.


> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819851#comment-15819851
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:43 AM:
-

[~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of 
features in high efficiency ? If only the interface supporting distributed 
vector, but internal computation still use local vector and/or local matrix, 
then it seems won't make much sense...
Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing 
mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be 
computed in parallel ?
I look into L-BFGS-B code in breeze and the core updating Hessian and computing 
descent direction in L-BFGS-B is very complex, this part it cannot reuse LBFGS 
code. So, through which way LBFGS-B can take advantage of `Vector-free LBFGS` ? 


was (Author: weichenxu123):
[~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of 
features in high efficiency ? If only the interface supporting distributed 
vector, but internal computation still use local vector and/or local matrix, 
then it seems won't make much sense...
Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing 
mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be 
computed in parallel ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19051) test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in python/run-tests

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819957#comment-15819957
 ] 

Hyukjin Kwon edited comment on SPARK-19051 at 1/12/17 4:36 AM:
---

I just wonder if this JIRA says a flaky test or a constantly failing test.


was (Author: hyukjin.kwon):
I just if this JIRA says a flaky test or a constantly failing test.

> test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in 
> python/run-tests
> ---
>
> Key: SPARK-19051
> URL: https://issues.apache.org/jira/browse/SPARK-19051
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04, ppc64le, x86_64
>Reporter: Nirman Narang
>Priority: Minor
>
> Full log here.
> {code:title=python/run-tests|borderStyle=solid}
> Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache
> The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars
> file:/tmp/tmp7Ie4FN added as a remote repository with the name: repo-1
> :: loading settings :: url = 
> jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> a#mylib added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found a#mylib;0.1 in repo-1
> :: resolution report :: resolve 428ms :: artifacts dl 5ms
>   :: modules in use:
>   a#mylib;0.1 from repo-1 in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   1   |   0   |   0   |   0   ||   1   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 1 already retrieved (0kB/16ms)
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache
> The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars
> file:/tmp/tmpo5SXug added as a remote repository with the name: repo-1
> :: loading settings :: url = 
> jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> a#mylib added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found a#mylib;0.1 in repo-1
> :: resolution report :: resolve 320ms :: artifacts dl 4ms
>   :: modules in use:
>   a#mylib;0.1 from repo-1 in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   1   |   0   |   0   |   0   ||   1   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 1 already retrieved (0kB/7ms)
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> ...
> [Stage 23:==>   (143 + 5) / 
> 200]
> [Stage 23:==>   (188 + 5) / 
> 200]
>   
>   
> .
> [Stage 88:> (122 + 4) / 
> 200]
> [Stage 88:=>(198 + 2) / 
> 200]
>   
>   
> [Stage 90:>   (0 + 4) / 
> 200]
> [Stage 90:===>   (14 + 4) / 
> 200]
> [Stage 90:===>   (28 + 4) / 
> 200]
> [Stage 90:===>   (42 + 4) / 
> 200]
> [Stage 90:===>   (57 + 4) / 
> 200]
> [Stage

[jira] [Updated] (SPARK-19188) Run spark in scala as script file, note not just REPL

2017-01-11 Thread wangzhihao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-19188:
---
Description: 
Hi, I'm looking for the feature to run spark/scala in script file. The current 
spark-shell is a REPL and doesn't not exit with proper exit values if any error 
happens.

How to implement such a feature? Thanks 

  was:
Hi, I'm looking for the feature to run spark/scala in script file. The current 
spark-shell is a REPL and doesn't not exit if any error happens.

How to implement such a feature? Thanks 


> Run spark in scala as script file, note not just REPL
> -
>
> Key: SPARK-19188
> URL: https://issues.apache.org/jira/browse/SPARK-19188
> Project: Spark
>  Issue Type: New Feature
>Reporter: wangzhihao
>  Labels: newbie
>
> Hi, I'm looking for the feature to run spark/scala in script file. The 
> current spark-shell is a REPL and doesn't not exit with proper exit values if 
> any error happens.
> How to implement such a feature? Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19188) Run spark in scala as script file, note not just REPL

2017-01-11 Thread wangzhihao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-19188:
---
Description: 
Hi, I'm looking for the feature to run spark/scala in script file. The current 
spark-shell is a REPL and doesn't not exit if any error happens.

How to implement such a feature? Thanks 

  was:
Hi, I'm looking for the feature to run spark/scala in script file. The current 
spark-submit is an REPL and doesn't not exit if any error happens.

How to implement such a feature? Thanks 


> Run spark in scala as script file, note not just REPL
> -
>
> Key: SPARK-19188
> URL: https://issues.apache.org/jira/browse/SPARK-19188
> Project: Spark
>  Issue Type: New Feature
>Reporter: wangzhihao
>  Labels: newbie
>
> Hi, I'm looking for the feature to run spark/scala in script file. The 
> current spark-shell is a REPL and doesn't not exit if any error happens.
> How to implement such a feature? Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19188) Run spark in scala as script file, note not just REPL

2017-01-11 Thread wangzhihao (JIRA)

wangzhihao created SPARK-19188:
--

 Summary: Run spark in scala as script file, note not just REPL
 Key: SPARK-19188
 URL: https://issues.apache.org/jira/browse/SPARK-19188
 Project: Spark
  Issue Type: New Feature
Reporter: wangzhihao


Hi, I'm looking for the feature to run spark/scala in script file. The current 
spark-submit is an REPL and doesn't not exit if any error happens.

How to implement such a feature? Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19133) SparkR glm Gamma family results in error

2017-01-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19133:
-
Affects Version/s: 2.0.0
 Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.2.0)
Fix Version/s: 2.1.1
   2.0.3

> SparkR glm Gamma family results in error
> 
>
> Key: SPARK-19133
> URL: https://issues.apache.org/jira/browse/SPARK-19133
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> > glm(y~1,family=Gamma, data = dy)
> 17/01/09 06:10:47 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter 
> family given invalid value Gamma.
>   at org.apache.spark.ml.param.Param.validate(params.scala:77)
>   at org.apache.spark.ml.param.ParamPair.(params.scala:528)
>   at

[jira] [Created] (SPARK-19187) querying from parquet partitioned table throws FileNotFoundException when some partitions' hdfs locations do not exist

2017-01-11 Thread roncenzhao (JIRA)

roncenzhao created SPARK-19187:
--

 Summary: querying from parquet partitioned table throws 
FileNotFoundException when some partitions' hdfs locations do not exist
 Key: SPARK-19187
 URL: https://issues.apache.org/jira/browse/SPARK-19187
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: roncenzhao


Hi, all.
When the parquet partitioned table's some partition's hdfs paths do not exist, 
querying from it throws FileNotFoundException .

The error stack is :
`
TaskSetManager: Lost task 522.0 in stage 1.0 (TID 523, 
sd-hadoop-datanode-50-135.i
dc.vip.com): java.io.FileNotFoundException: File does not exist: 
hdfs://bipcluster/bip/external_table/vip
dw/dw_log_app_pageinfo_clean_spark_parquet/dt=20161223/hm=1730
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(
fileSourceInterfaces.scala:465)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(
fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
`




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception

2017-01-11 Thread Adrian Schulewitz (JIRA)

Adrian Schulewitz created SPARK-19186:
-

 Summary: Hash symbol in middle of Sybase database table name 
causes Spark Exception
 Key: SPARK-19186
 URL: https://issues.apache.org/jira/browse/SPARK-19186
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Adrian Schulewitz


If I use a table name without a '#' symbol in the middle then no exception 
occurs but with one an exception is thrown. According to Sybase 15 
documentation a '#' is a legal character.

val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF"

val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via 
SQL").setMaster("local[2]")

val sess = SparkSession
  .builder()
  .appName("MUREX DMart Simple SQL Reader")
  .config(conf)
  .getOrCreate()

import sess.implicits._

val df = sess.read
.format("jdbc")
.option("url", 
"jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56")
.option("driver", "net.sourceforge.jtds.jdbc.Driver")
.option("dbtable", "CTP#ADR_TYPE_DBF")
.option("UDT_DEALCRD_REP", "mxdmart56")
.option("user", "INSTAL")
.option("password", "INSTALL")
.load()

df.createOrReplaceTempView("trades")

val resultsDF = sess.sql(testSql)
resultsDF.show()

17/01/12 14:30:01 INFO SharedState: Warehouse path is 
'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'.
17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades
17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM 
CTP#ADR_TYPE_DBF
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 
'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 
'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 
'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 
'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 
'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 
'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 
'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 
'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 
'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 
'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 
'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 
'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 
'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 
'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 
'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 1, pos 17)

== SQL ==
SELECT * FROM CTP#ADR_TYPE_DBF
-^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at 
com.anz.murex.hcp.poc.hcp.api.MurexDatamartSqlReader$.main(MurexDatamartSqlReader.scala:94)
at 
com.anz.murex.hcp.poc.hcp.api.MurexDatamartSqlReader.main(MurexDatamartSqlReader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 3:02 AM:
-

[~debasish83] 
Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector 
interface (move into breeze ?) and make them support multiple distributed 
platform(not only spark) will make the optimization against spark platform 
difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many 
optimizations need to combine spark features and the optimizer algorithm 
closely, make a abstract interface supporting distributed vector (for example, 
Vector space operator include dot, add, scale, persist/unpersist operators and 
so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 


was (Author: weichenxu123):
[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL:

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:55 AM:
-

[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 


was (Author: weichenxu123):
[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>

[jira] [Commented] (SPARK-14901) java exception when showing join

2017-01-11 Thread Brent Elmer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819984#comment-15819984
 ] 

Brent Elmer commented on SPARK-14901:
-

Netezza is an IBM product so there is no place to download it from.  I
don't know if the problem only occurs when using Netezza or not.  I
wouldn't have a reproducer any smaller than the snippet of code in the
bug report.

Brent


> java exception when showing join
> 
>
> Key: SPARK-14901
> URL: https://issues.apache.org/jira/browse/SPARK-14901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Brent Elmer
>
> I am using pyspark with netezza.  I am getting a java exception when trying 
> to show the first row of a join.  I can show the first row for of the two 
> dataframes separately but not the result of a join.  I get the same error for 
> any action I take(first, collect, show).  Am I doing something wrong?
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dispute_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table1', 
> driver='com.ibm.spark.netezza').load()
> dispute_df.printSchema()
> comments_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table2', 
> driver='com.ibm.spark.netezza').load()
> comments_df.printSchema()
> dispute_df.join(comments_df, dispute_df.COMMENTID == 
> comments_df.COMMENTID).first()
> root
>  |-- COMMENTID: string (nullable = true)
>  |-- EXPORTDATETIME: timestamp (nullable = true)
>  |-- ARTAGS: string (nullable = true)
>  |-- POTAGS: string (nullable = true)
>  |-- INVTAG: string (nullable = true)
>  |-- ACTIONTAG: string (nullable = true)
>  |-- DISPUTEFLAG: string (nullable = true)
>  |-- ACTIONFLAG: string (nullable = true)
>  |-- CUSTOMFLAG1: string (nullable = true)
>  |-- CUSTOMFLAG2: string (nullable = true)
> root
>  |-- COUNTRY: string (nullable = true)
>  |-- CUSTOMER: string (nullable = true)
>  |-- INVNUMBER: string (nullable = true)
>  |-- INVSEQNUMBER: string (nullable = true)
>  |-- LEDGERCODE: string (nullable = true)
>  |-- COMMENTTEXT: string (nullable = true)
>  |-- COMMENTTIMESTAMP: timestamp (nullable = true)
>  |-- COMMENTLENGTH: long (nullable = true)
>  |-- FREEINDEX: long (nullable = true)
>  |-- COMPLETEDFLAG: long (nullable = true)
>  |-- ACTIONFLAG: long (nullable = true)
>  |-- FREETEXT: string (nullable = true)
>  |-- USERNAME: string (nullable = true)
>  |-- ACTION: string (nullable = true)
>  |-- COMMENTID: string (nullable = true)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   5 comments_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table2', 
> driver='com.ibm.spark.netezza').load()
>   6 comments_df.printSchema()
> > 7 dispute_df.join(comments_df, dispute_df.COMMENTID == 
> comments_df.COMMENTID).first()
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in first(self)
> 802 Row(age=2, name=u'Alice')
> 803 """
> --> 804 return self.head()
> 805 
> 806 @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in head(self, n)
> 790 """
> 791 if n is None:
> --> 792 rs = self.head(1)
> 793 return rs[0] if rs else None
> 794 return self.take(n)
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in head(self, n)
> 792 rs = self.head(1)
> 793 return rs[0] if rs else None
> --> 794 return self.take(n)
> 795 
> 796 @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in take(self, num)
> 304 with SCCallSiteSync(self._sc) as css:
> 305 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 306 self._jdf, num)
> 307 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 308 
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
>

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:48 AM:
-

[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?


was (Author: weichenxu123):
[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee:

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:45 AM:
-

[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been actually persisted(although the persist API called), but already been 
unpersisted, so that such awful situation will cause the whole RDD lineage 
recomputation.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?


was (Author: weichenxu123):
[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been persisted, but already been unpersisted.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
>

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964
 ] 

Weichen Xu commented on SPARK-10078:


[~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found 
that many optimizations need to combine spark features and the optimizer 
algorithm closely, make a abstract interface supporting distributed vector (for 
example, Vector space operator include dot, add, scale, persist/unpersist 
operators and so on...) seems not enough.
I give two simple problem to show the complexity when considering general 
interface:
1. Look this VF-OWLQN implementation based on spark: 
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, 
look the code function `calculateComponentWithL1`, here when computing 
pseudo-gradient using RDD, it also use an accumulator(only spark have) to 
calculate the adjusted fnValue, so that will the abstract interface containing 
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy 
computation feature, improper persist/unpersist/checkpoint order may cause 
serious problem (may cause RDD recomputation, checkpoint take no effect and so 
on), about this complexity, we can take a look into the VF-BFGS implementation 
on spark:
it use the pattern "persist current step RDDs, then unpersist previous step 
RDDs" like many other algos in spark mllib. The complexity is at, spark always 
do lazy computation, when you persist RDD, it do not persist immediately, but 
postponed to RDD.action called. If the internal code call `unpersist` too 
early, it will cause the problem that an RDD haven't been computed and haven't 
been persisted, but already been unpersisted.
This feature may be much different than other distributed platform, so that a 
general interface can really handle this problem correctly and still keep high 
efficient in the same time? 
[~sethah] Do you consider this detail problems when you designing the general 
optimizer interface ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19051) test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in python/run-tests

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819957#comment-15819957
 ] 

Hyukjin Kwon commented on SPARK-19051:
--

I just if this JIRA says a flaky test or a constantly failing test.

> test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in 
> python/run-tests
> ---
>
> Key: SPARK-19051
> URL: https://issues.apache.org/jira/browse/SPARK-19051
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04, ppc64le, x86_64
>Reporter: Nirman Narang
>Priority: Minor
>
> Full log here.
> {code:title=python/run-tests|borderStyle=solid}
> Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache
> The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars
> file:/tmp/tmp7Ie4FN added as a remote repository with the name: repo-1
> :: loading settings :: url = 
> jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> a#mylib added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found a#mylib;0.1 in repo-1
> :: resolution report :: resolve 428ms :: artifacts dl 5ms
>   :: modules in use:
>   a#mylib;0.1 from repo-1 in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   1   |   0   |   0   |   0   ||   1   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 1 already retrieved (0kB/16ms)
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache
> The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars
> file:/tmp/tmpo5SXug added as a remote repository with the name: repo-1
> :: loading settings :: url = 
> jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> a#mylib added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found a#mylib;0.1 in repo-1
> :: resolution report :: resolve 320ms :: artifacts dl 4ms
>   :: modules in use:
>   a#mylib;0.1 from repo-1 in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   1   |   0   |   0   |   0   ||   1   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 1 already retrieved (0kB/7ms)
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m
> ...
> [Stage 23:==>   (143 + 5) / 
> 200]
> [Stage 23:==>   (188 + 5) / 
> 200]
>   
>   
> .
> [Stage 88:> (122 + 4) / 
> 200]
> [Stage 88:=>(198 + 2) / 
> 200]
>   
>   
> [Stage 90:>   (0 + 4) / 
> 200]
> [Stage 90:===>   (14 + 4) / 
> 200]
> [Stage 90:===>   (28 + 4) / 
> 200]
> [Stage 90:===>   (42 + 4) / 
> 200]
> [Stage 90:===>   (57 + 4) / 
> 200]
> [Stage 90:>  (73 + 4) / 
> 200]
> [Stage 90:===>   (86 + 4) / 
> 200]

[jira] [Resolved] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv

2017-01-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17923.
--
Resolution: Duplicate

This should be fixed in 2.0.1 and 2.1.0.

> dateFormat unexpected kwarg to df.write.csv
> ---
>
> Key: SPARK-17923
> URL: https://issues.apache.org/jira/browse/SPARK-17923
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> Calling like this:
> {code}writer.csv(path, header=header, sep=sep, compression=compression, 
> dateFormat=date_format){code}
> Getting the following error:
> {code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code}
> This error comes after being called with {code}date_format='-MM-dd'{code} 
> as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19184:


Assignee: Apache Spark

> Improve numerical stability for method tallSkinnyQR.
> 
>
> Key: SPARK-19184
> URL: https://issues.apache.org/jira/browse/SPARK-19184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Huamin Li
>Assignee: Apache Spark
>Priority: Minor
>  Labels: None
>
> In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
> Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
>  When the upper triangular matrix R is ill-conditioned, computing the inverse 
> of R can result in catastrophic cancellation. Instead, we should consider 
> using a forward solve for solving Q such that Q * R = A.
> I first create a 4 by 4 RowMatrix A = 
> (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
> tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
> case, A is ill-conditioned and so is R.
> See codes in Spark Shell:
> {code:none}
> import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> // Create RowMatrix A.
> val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
> Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
> val denseMat = new RowMatrix(sc.parallelize(mat, 2))
> // Apply tallSkinnyQR to A.
> val result = denseMat.tallSkinnyQR(true)
> // Print the calculated Q and R.
> result.Q.rows.collect.foreach(println)
> result.R
> // Calculate Q*R. Ideally, this should be close to A.
> val reconstruct = result.Q.multiply(result.R)
> reconstruct.rows.collect.foreach(println)
> // Calculate Q'*Q. Ideally, this should be close to the identity matrix.
> result.Q.computeGramianMatrix()
> System.exit(0)
> {code}
> it will output the following results:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,1.5416524685312E13]
> [0.0,0.,0.0,8011776.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.15416524685312]
> [0.0,9.999E-6,0.,1.0008011776]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0 0.0 0.0  1.5416524685312E13
> 0.0 0.9998  0.0  8011775.9
> 0.0 0.0 1.0  0.0
> 1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
> {code}
> With forward solve for solving Q such that Q * R = A rather than computing 
> the inverse of R, it will output the following results instead:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,0.0]
> [0.0,1.0,0.0,0.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.0]
> [0.0,1.0E-5,1.0,1.0]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0  0.0  0.0  0.0
> 0.0  1.0  0.0  0.0
> 0.0  0.0  1.0  0.0
> 0.0  0.0  0.0  1.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19184:


Assignee: (was: Apache Spark)

> Improve numerical stability for method tallSkinnyQR.
> 
>
> Key: SPARK-19184
> URL: https://issues.apache.org/jira/browse/SPARK-19184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Huamin Li
>Priority: Minor
>  Labels: None
>
> In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
> Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
>  When the upper triangular matrix R is ill-conditioned, computing the inverse 
> of R can result in catastrophic cancellation. Instead, we should consider 
> using a forward solve for solving Q such that Q * R = A.
> I first create a 4 by 4 RowMatrix A = 
> (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
> tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
> case, A is ill-conditioned and so is R.
> See codes in Spark Shell:
> {code:none}
> import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> // Create RowMatrix A.
> val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
> Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
> val denseMat = new RowMatrix(sc.parallelize(mat, 2))
> // Apply tallSkinnyQR to A.
> val result = denseMat.tallSkinnyQR(true)
> // Print the calculated Q and R.
> result.Q.rows.collect.foreach(println)
> result.R
> // Calculate Q*R. Ideally, this should be close to A.
> val reconstruct = result.Q.multiply(result.R)
> reconstruct.rows.collect.foreach(println)
> // Calculate Q'*Q. Ideally, this should be close to the identity matrix.
> result.Q.computeGramianMatrix()
> System.exit(0)
> {code}
> it will output the following results:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,1.5416524685312E13]
> [0.0,0.,0.0,8011776.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.15416524685312]
> [0.0,9.999E-6,0.,1.0008011776]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0 0.0 0.0  1.5416524685312E13
> 0.0 0.9998  0.0  8011775.9
> 0.0 0.0 1.0  0.0
> 1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
> {code}
> With forward solve for solving Q such that Q * R = A rather than computing 
> the inverse of R, it will output the following results instead:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,0.0]
> [0.0,1.0,0.0,0.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.0]
> [0.0,1.0E-5,1.0,1.0]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0  0.0  0.0  0.0
> 0.0  1.0  0.0  0.0
> 0.0  0.0  1.0  0.0
> 0.0  0.0  0.0  1.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819941#comment-15819941
 ] 

Apache Spark commented on SPARK-19184:
--

User 'hl475' has created a pull request for this issue:
https://github.com/apache/spark/pull/16556

> Improve numerical stability for method tallSkinnyQR.
> 
>
> Key: SPARK-19184
> URL: https://issues.apache.org/jira/browse/SPARK-19184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Huamin Li
>Priority: Minor
>  Labels: None
>
> In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
> Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
>  When the upper triangular matrix R is ill-conditioned, computing the inverse 
> of R can result in catastrophic cancellation. Instead, we should consider 
> using a forward solve for solving Q such that Q * R = A.
> I first create a 4 by 4 RowMatrix A = 
> (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
> tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
> case, A is ill-conditioned and so is R.
> See codes in Spark Shell:
> {code:none}
> import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> // Create RowMatrix A.
> val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
> Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
> val denseMat = new RowMatrix(sc.parallelize(mat, 2))
> // Apply tallSkinnyQR to A.
> val result = denseMat.tallSkinnyQR(true)
> // Print the calculated Q and R.
> result.Q.rows.collect.foreach(println)
> result.R
> // Calculate Q*R. Ideally, this should be close to A.
> val reconstruct = result.Q.multiply(result.R)
> reconstruct.rows.collect.foreach(println)
> // Calculate Q'*Q. Ideally, this should be close to the identity matrix.
> result.Q.computeGramianMatrix()
> System.exit(0)
> {code}
> it will output the following results:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,1.5416524685312E13]
> [0.0,0.,0.0,8011776.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.15416524685312]
> [0.0,9.999E-6,0.,1.0008011776]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0 0.0 0.0  1.5416524685312E13
> 0.0 0.9998  0.0  8011775.9
> 0.0 0.0 1.0  0.0
> 1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
> {code}
> With forward solve for solving Q such that Q * R = A rather than computing 
> the inverse of R, it will output the following results instead:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,0.0]
> [0.0,1.0,0.0,0.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.0]
> [0.0,1.0E-5,1.0,1.0]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0  0.0  0.0  0.0
> 0.0  1.0  0.0  0.0
> 0.0  0.0  1.0  0.0
> 0.0  0.0  0.0  1.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15407) Floor division

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819926#comment-15819926
 ] 

Hyukjin Kwon commented on SPARK-15407:
--

Hi [~hvanhovell], I just wonder if we should resolve this as {{Won't Fix}} or 
{{Not A Problem}}.

> Floor division
> --
>
> Key: SPARK-15407
> URL: https://issues.apache.org/jira/browse/SPARK-15407
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Joao Ferreira
>
> I'm unable to perform floor division on DataFrame columns:
> df.withColumn("d_date", df.d_date // 100)
> Will produce:
> TypeError: unsupported operand type(s) for //: 'Column' and 'int'
> My column is of the int type. Basic operations such as 
> sum/division/subtraction work fine.
> The inner workings of PySpark mention a floor function inside 
> sql/functions... but the actual code to perform this operation is missing.
> Am I missing anything?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15251) Cannot apply PythonUDF to aggregated column

2017-01-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15251.
--
Resolution: Cannot Reproduce

{code}
>>> def timesTwo(x):
... return x * 2
...
>>> sqlContext.udf.register("timesTwo", timesTwo)
data = [(1, 'a'), (2, 'b')]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, ["x", "y"])

df.registerTempTable("my_data")
sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
>>> data = [(1, 'a'), (2, 'b')]
>>> rdd = sc.parallelize(data)
>>> df = sqlContext.createDataFrame(rdd, ["x", "y"])
>>>
>>> df.registerTempTable("my_data")
>>> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
+---+
|  t|
+---+
|  6|
+---+
{code}

It seems it is fixed somewhere and I can't reproduce this in the current 
master. It'd be great if this is backported if anyone can point out the PR

> Cannot apply PythonUDF to aggregated column
> ---
>
> Key: SPARK-15251
> URL: https://issues.apache.org/jira/browse/SPARK-15251
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Matthew Livesey
>
> In scala it is possible to define a UDF an apply it to an aggregated value in 
> an expression, for example:
> {code}
> def timesTwo(x: Int): Int = x * 2
> sqlContext.udf.register("timesTwo", timesTwo _)
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> case class Data(x: Int, y: String)
> val data = List(Data(1, "a"), Data(2, "b"))
> val rdd = sc.parallelize(data)
> val df = rdd.toDF
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() 
> +---+
> |  t|
> +---+
> |  6|
> +---+
> {code}
> Performing the same computation in pyspark:
> {code}
> def timesTwo(x):
> return x * 2
> sqlContext.udf.register("timesTwo", timesTwo)
> data = [(1, 'a'), (2, 'b')]
> rdd = sc.parallelize(data)
> df = sqlContext.createDataFrame(rdd, ["x", "y"])
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> {code}
> Gives the following:
> {code}
> AnalysisException: u"expression 'pythonUDF' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;"
> {code}
> Using a lambda rather than a named function gives the same error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14901) java exception when showing join

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819920#comment-15819920
 ] 

Hyukjin Kwon commented on SPARK-14901:
--

Would this be possible to provide a self-contained reproducer? I can't 
reproduce this as below within Spark:

{code}
dispute_df = spark.range(10).toDF("COMMENTID")
comments_df = spark.range(5).toDF("COMMENTID")
dispute_df.join(comments_df, dispute_df.COMMENTID == 
comments_df.COMMENTID).first()
{code}

Do you mind if I ask where I can download Netezza?

> java exception when showing join
> 
>
> Key: SPARK-14901
> URL: https://issues.apache.org/jira/browse/SPARK-14901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Brent Elmer
>
> I am using pyspark with netezza.  I am getting a java exception when trying 
> to show the first row of a join.  I can show the first row for of the two 
> dataframes separately but not the result of a join.  I get the same error for 
> any action I take(first, collect, show).  Am I doing something wrong?
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dispute_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table1', 
> driver='com.ibm.spark.netezza').load()
> dispute_df.printSchema()
> comments_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table2', 
> driver='com.ibm.spark.netezza').load()
> comments_df.printSchema()
> dispute_df.join(comments_df, dispute_df.COMMENTID == 
> comments_df.COMMENTID).first()
> root
>  |-- COMMENTID: string (nullable = true)
>  |-- EXPORTDATETIME: timestamp (nullable = true)
>  |-- ARTAGS: string (nullable = true)
>  |-- POTAGS: string (nullable = true)
>  |-- INVTAG: string (nullable = true)
>  |-- ACTIONTAG: string (nullable = true)
>  |-- DISPUTEFLAG: string (nullable = true)
>  |-- ACTIONFLAG: string (nullable = true)
>  |-- CUSTOMFLAG1: string (nullable = true)
>  |-- CUSTOMFLAG2: string (nullable = true)
> root
>  |-- COUNTRY: string (nullable = true)
>  |-- CUSTOMER: string (nullable = true)
>  |-- INVNUMBER: string (nullable = true)
>  |-- INVSEQNUMBER: string (nullable = true)
>  |-- LEDGERCODE: string (nullable = true)
>  |-- COMMENTTEXT: string (nullable = true)
>  |-- COMMENTTIMESTAMP: timestamp (nullable = true)
>  |-- COMMENTLENGTH: long (nullable = true)
>  |-- FREEINDEX: long (nullable = true)
>  |-- COMPLETEDFLAG: long (nullable = true)
>  |-- ACTIONFLAG: long (nullable = true)
>  |-- FREETEXT: string (nullable = true)
>  |-- USERNAME: string (nullable = true)
>  |-- ACTION: string (nullable = true)
>  |-- COMMENTID: string (nullable = true)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   5 comments_df = 
> sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
>  user='***', password='***', dbtable='table2', 
> driver='com.ibm.spark.netezza').load()
>   6 comments_df.printSchema()
> > 7 dispute_df.join(comments_df, dispute_df.COMMENTID == 
> comments_df.COMMENTID).first()
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in first(self)
> 802 Row(age=2, name=u'Alice')
> 803 """
> --> 804 return self.head()
> 805 
> 806 @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in head(self, n)
> 790 """
> 791 if n is None:
> --> 792 rs = self.head(1)
> 793 return rs[0] if rs else None
> 794 return self.take(n)
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in head(self, n)
> 792 rs = self.head(1)
> 793 return rs[0] if rs else None
> --> 794 return self.take(n)
> 795 
> 796 @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc
>  in take(self, num)
> 304 with SCCallSiteSync(self._sc) as css:
> 305 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 306 self._jdf, num)
> 307 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 308 
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client,

[jira] [Created] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-11 Thread Kalvin Chau (JIRA)

Kalvin Chau created SPARK-19185:
---

 Summary: ConcurrentModificationExceptions with 
CachedKafkaConsumers when Windowing
 Key: SPARK-19185
 URL: https://issues.apache.org/jira/browse/SPARK-19185
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2
 Environment: Spark 2.0.2
Spark Streaming Kafka 010
Mesos 0.28.0 - client mode

spark.executor.cores 1
spark.mesos.extra.cores 1
Reporter: Kalvin Chau


We've been running into ConcurrentModificationExcpetions "KafkaConsumer is not 
safe for multi-threaded access" with the CachedKafkaConsumer. I've been working 
through debugging this issue and after looking through some of the spark source 
code I think this is a bug.

Our set up is:
Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
Spark-Streaming-Kafka-010
spark.executor.cores 1
spark.mesos.extra.cores 1

Batch interval: 10s, window interval: 180s, and slide interval: 30s

We would see the exception when in one executor there are two task worker 
threads assigned the same Topic+Partition, but a different set of offsets.

They would both get the same CachedKafkaConsumer, and whichever task thread 
went first would seek and poll for all the records, and at the same time the 
second thread would try to seek to its offset but fail because it is unable to 
acquire the lock.

Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z

Time1 E0 Task0 - Seeks and starts to poll
Time1 E0 Task1 - Attempts to seek, but fails

Here are some relevant logs:

{code}
17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing topic 
test-topic, partition 2 offsets 4394204414 -> 4394238058
17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing topic 
test-topic, partition 2 offsets 4394238058 -> 4394257712
17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: Get 
spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 4394204414
17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: Get 
spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 4394238058
17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
Initial fetch for spark-executor-consumer test-topic 2 4394238058
17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
Seeking to test-topic-2 4394238058
17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
block rdd_199_2 failed due to an exception
17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
rdd_199_2 could not be removed as it was not found on disk or in memory
17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
task 49.0 in stage 45.0 (TID 3201)
java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at

[jira] [Created] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2017-01-11 Thread Huamin Li (JIRA)

Huamin Li created SPARK-19184:
-

 Summary: Improve numerical stability for method tallSkinnyQR.
 Key: SPARK-19184
 URL: https://issues.apache.org/jira/browse/SPARK-19184
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.0
Reporter: Huamin Li
Priority: Minor


In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
 When the upper triangular matrix R is ill-conditioned, computing the inverse 
of R can result in catastrophic cancellation. Instead, we should consider using 
a forward solve for solving Q such that Q * R = A.

I first create a 4 by 4 RowMatrix A = 
(1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
case, A is ill-conditioned and so is R.

See codes in Spark Shell:
{code:none}
import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix

// Create RowMatrix A.
val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
val denseMat = new RowMatrix(sc.parallelize(mat, 2))

// Apply tallSkinnyQR to A.
val result = denseMat.tallSkinnyQR(true)

// Print the calculated Q and R.
result.Q.rows.collect.foreach(println)
result.R

// Calculate Q*R. Ideally, this should be close to A.
val reconstruct = result.Q.multiply(result.R)
reconstruct.rows.collect.foreach(println)

// Calculate Q'*Q. Ideally, this should be close to the identity matrix.
result.Q.computeGramianMatrix()

System.exit(0)
{code}

it will output the following results:
{code:none}
scala> result.Q.rows.collect.foreach(println)
[1.0,0.0,0.0,1.5416524685312E13]
[0.0,0.,0.0,8011776.0]
[0.0,0.0,1.0,0.0]
[0.0,0.0,0.0,1.0]

scala> result.R
1.0  1.0 1.0  1.0
0.0  1.0E-5  1.0  1.0
0.0  0.0 1.0E-10  1.0
0.0  0.0 0.0  1.0E-14

scala> reconstruct.rows.collect.foreach(println)
[1.0,1.0,1.0,1.15416524685312]
[0.0,9.999E-6,0.,1.0008011776]
[0.0,0.0,1.0E-10,1.0]
[0.0,0.0,0.0,1.0E-14]

scala> result.Q.computeGramianMatrix()
1.0 0.0 0.0  1.5416524685312E13
0.0 0.9998  0.0  8011775.9
0.0 0.0 1.0  0.0
1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
{code}

With forward solve for solving Q such that Q * R = A rather than computing the 
inverse of R, it will output the following results instead:
{code:none}
scala> result.Q.rows.collect.foreach(println)
[1.0,0.0,0.0,0.0]
[0.0,1.0,0.0,0.0]
[0.0,0.0,1.0,0.0]
[0.0,0.0,0.0,1.0]

scala> result.R
1.0  1.0 1.0  1.0
0.0  1.0E-5  1.0  1.0
0.0  0.0 1.0E-10  1.0
0.0  0.0 0.0  1.0E-14

scala> reconstruct.rows.collect.foreach(println)
[1.0,1.0,1.0,1.0]
[0.0,1.0E-5,1.0,1.0]
[0.0,0.0,1.0E-10,1.0]
[0.0,0.0,0.0,1.0E-14]

scala> result.Q.computeGramianMatrix()
1.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0
0.0  0.0  1.0  0.0
0.0  0.0  0.0  1.0
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13303) Spark fails with pandas import error when pandas is not explicitly imported by user

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819894#comment-15819894
 ] 

Hyukjin Kwon commented on SPARK-13303:
--

+1

> Spark fails with pandas import error when pandas is not explicitly imported 
> by user
> ---
>
> Key: SPARK-13303
> URL: https://issues.apache.org/jira/browse/SPARK-13303
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: The python installation used by the driver (edge node) 
> has pandas installed on it, while on the data nodes pandas do not have pandas 
> installed in the python runtimes used. Pandas is never explicitly imported by 
> pi.py.
>Reporter: Juliet Hougland
>
> Running `spark-submit pi.py` results in:
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 98, in main
> command = pickleSer._read_with_length(infile)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 164, in _read_with_length
> return self.loads(obj)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 422, in loads
> return pickle.loads(obj)
> ImportError: No module named pandas.algos
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> This is unexpected and hard for users to unravel why they may see this error, 
> as they themselves have not explicitly done anything with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-01-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819883#comment-15819883
 ] 

Hyukjin Kwon commented on SPARK-12717:
--

It still happens in the current master.

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
> {code:title=abridge_run.txt|borderStyle=solid}
> % spark-submit --master local[10] bug_spark.py --parallelism 20
> [snip]
> 16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
>

[jira] [Resolved] (SPARK-11428) Schema Merging Broken for Some Queries

2017-01-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-11428.
--
Resolution: Duplicate

I am pretty sure that it is a duplicate of SPARK-11103. Please reopen this if 
anyone meets the same issue.

> Schema Merging Broken for Some Queries
> --
>
> Key: SPARK-11428
> URL: https://issues.apache.org/jira/browse/SPARK-11428
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.1
> Environment: AWS,
>Reporter: Brad Willard
>  Labels: dataframe, parquet, pyspark, schema, sparksql
>
> I have data being written into parquet format via spark streaming. The data 
> can change slightly so schema merging is required. I load a dataframe like 
> this
> {code}
> urls = [
> "/streaming/parquet/events/key=2015-10-30*",
> "/streaming/parquet/events/key=2015-10-29*"
> ]
> sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls)
> sdf.registerTempTable('events')
> {code}
> If I print the schema you can see the contested column
> {code}
> sdf.printSchema()
> root
>  |-- _id: string (nullable = true)
> ...
>  |-- d__device_s: string (nullable = true)
>  |-- d__isActualPageLoad_s: string (nullable = true)
>  |-- d__landing_s: string (nullable = true)
>  |-- d__lang_s: string (nullable = true)
>  |-- d__os_s: string (nullable = true)
>  |-- d__performance_i: long (nullable = true)
>  |-- d__product_s: string (nullable = true)
>  |-- d__refer_s: string (nullable = true)
>  |-- d__rk_i: long (nullable = true)
>  |-- d__screen_s: string (nullable = true)
>  |-- d__submenuName_s: string (nullable = true)
> {code}
> The column that's in one but not the other file is  d__product_s
> So I'm able to run this query and it works fine.
> {code}
> sql_context.sql('''
> select 
> distinct(d__product_s) 
> from 
> events
> where 
> n = 'view'
> ''').collect()
> [Row(d__product_s=u'website'),
>  Row(d__product_s=u'store'),
>  Row(d__product_s=None),
>  Row(d__product_s=u'page')]
> {code}
> However if I instead use that column in the where clause things break.
> {code}
> sql_context.sql('''
> select 
> * 
> from 
> events
> where 
> n = 'view' and d__product_s = 'page'
> ''').take(1)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   6 where
>   7 n = 'frontsite_view' and d__product_s = 'page'
> > 8 ''').take(1)
> /root/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
> 303 with SCCallSiteSync(self._sc) as css:
> 304 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 305 self._jdf, num)
> 306 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 307 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  34 def deco(*a, **kw):
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
>  38 s = e.java_exception.toString()
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage 
> 15.0 (TID 6536, 10.X.X.X): java.lang.IllegalArgumentException: Column 
> [d__product_s] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
>

[jira] [Resolved] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

2017-01-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8128.
-
Resolution: Duplicate

I am pretty sure that it duplicates SPARK-11103. Please reopen this if anyone 
meets the same problem.

> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> 
>
> Key: SPARK-8128
> URL: https://issues.apache.org/jira/browse/SPARK-8128
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and 
> loading it into one dataframe so schema merging is involved. There is some 
> bug with the schema merging that you print the schema and it shows all the 
> attributes. However when you run a query and filter on that attribute is 
> errors saying it's not in the schema. The query is incorrectly going to one 
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  ...
> \|-- items: array (nullable = true)
>  \||-- element: struct (containsNull = true)
>  \|||-- _id: string (nullable = true)
>  \|||-- addedOn: string (nullable = true)
>  \|||-- authorId: string (nullable = true)
>  \|||-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
> stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: 
> Column [mediaProcessingState] was not found in schema!
> at parquet.Preconditions.checkArgument(Preconditions.java:47)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-11 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819851#comment-15819851
 ] 

Weichen Xu commented on SPARK-10078:


[~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of 
features in high efficiency ? If only the interface supporting distributed 
vector, but internal computation still use local vector and/or local matrix, 
then it seems won't make much sense...
Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing 
mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be 
computed in parallel ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19164) Review of UserDefinedFunction._broadcast

2017-01-11 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819820#comment-15819820
 ] 

Reynold Xin commented on SPARK-19164:
-

Which one should I review? I see that you opened a bunch of WIP prs.


> Review of UserDefinedFunction._broadcast
> 
>
> Key: SPARK-19164
> URL: https://issues.apache.org/jira/browse/SPARK-19164
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> It doesn't look like {{UserDefinedFunction._broadcast}} is used at all. If 
> this is a valid observation it could be remove with corresponding 
> {{\_\_del\_\_}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl"

2017-01-11 Thread Xiaochen Ouyang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819716#comment-15819716
 ] 

Xiaochen Ouyang commented on SPARK-19115:
-

May I ask you whether Spark supports the following comman or not:create 
external table if not exists gen_tbl like src_tbl location 
'/warehouse/data/gen_tbl' later version?
Do you have a plan to support this command in the future?

> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl"
> ---
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupported the command " create external table if not exist  
> new_tbl like old_tbl"
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19180:


Assignee: Apache Spark

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19180:


Assignee: (was: Apache Spark)

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819634#comment-15819634
 ] 

Apache Spark commented on SPARK-19180:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16555

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819623#comment-15819623
 ] 

Apache Spark commented on SPARK-19183:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16554

> Add deleteWithJob hook to internal commit protocol API
> --
>
> Key: SPARK-19183
> URL: https://issues.apache.org/jira/browse/SPARK-19183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>
> Currently in SQL we implement overwrites by calling fs.delete() directly on 
> the original data. This is not ideal since we the original files end up 
> deleted even if the job aborts. We should extend the commit protocol to allow 
> file overwrites to be managed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19183:


Assignee: (was: Apache Spark)

> Add deleteWithJob hook to internal commit protocol API
> --
>
> Key: SPARK-19183
> URL: https://issues.apache.org/jira/browse/SPARK-19183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>
> Currently in SQL we implement overwrites by calling fs.delete() directly on 
> the original data. This is not ideal since we the original files end up 
> deleted even if the job aborts. We should extend the commit protocol to allow 
> file overwrites to be managed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19183:


Assignee: Apache Spark

> Add deleteWithJob hook to internal commit protocol API
> --
>
> Key: SPARK-19183
> URL: https://issues.apache.org/jira/browse/SPARK-19183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> Currently in SQL we implement overwrites by calling fs.delete() directly on 
> the original data. This is not ideal since we the original files end up 
> deleted even if the job aborts. We should extend the commit protocol to allow 
> file overwrites to be managed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API

2017-01-11 Thread Eric Liang (JIRA)

Eric Liang created SPARK-19183:
--

 Summary: Add deleteWithJob hook to internal commit protocol API
 Key: SPARK-19183
 URL: https://issues.apache.org/jira/browse/SPARK-19183
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang


Currently in SQL we implement overwrites by calling fs.delete() directly on the 
original data. This is not ideal since we the original files end up deleted 
even if the job aborts. We should extend the commit protocol to allow file 
overwrites to be managed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs

2017-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19182:
-
Summary: Optimize the lock in StreamingJobProgressListener to not block UI 
when generating Streaming jobs  (was: Optimize the lock in 
StreamingJobProgressListener to not block when generating Streaming jobs)

> Optimize the lock in StreamingJobProgressListener to not block UI when 
> generating Streaming jobs
> 
>
> Key: SPARK-19182
> URL: https://issues.apache.org/jira/browse/SPARK-19182
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Shixiong Zhu
>Priority: Minor
>
> When DStreamGraph is generating a job, it will hold a lock and block other 
> APIs. Because StreamingJobProgressListener (numInactiveReceivers, 
> streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to 
> access some information, the UI may hang if generating a job is very slow 
> (e.g., talking to the slow Kafka cluster to fetch metadata).
> It's better to optimize the locks in DStreamGraph and 
> StreamingJobProgressListener to make the UI not block by job generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs

2017-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19182:
-
Description: 
When DStreamGraph is generating a job, it will hold a lock and block other 
APIs. Because StreamingJobProgressListener (numInactiveReceivers, 
streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to 
access some information, the UI may hang if generating a job is very slow 
(e.g., talking to the slow Kafka cluster to fetch metadata).

It's better to optimize the locks in DStreamGraph and 
StreamingJobProgressListener to make the UI not block by job generation.

  was:
When DStreamGraph is generating a job, it will hold a lock and block other 
APIs. Because StreamingJobProgressListener needs to call DStreamGraph's methods 
to access some information, the UI may hang if generating a job is very slow 
(e.g., talking to the slow Kafka cluster to fetch metadata).

It's better to optimize the locks in DStreamGraph and 
StreamingJobProgressListener to make the UI not block by job generation.


> Optimize the lock in StreamingJobProgressListener to not block when 
> generating Streaming jobs
> -
>
> Key: SPARK-19182
> URL: https://issues.apache.org/jira/browse/SPARK-19182
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Shixiong Zhu
>Priority: Minor
>
> When DStreamGraph is generating a job, it will hold a lock and block other 
> APIs. Because StreamingJobProgressListener (numInactiveReceivers, 
> streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to 
> access some information, the UI may hang if generating a job is very slow 
> (e.g., talking to the slow Kafka cluster to fetch metadata).
> It's better to optimize the locks in DStreamGraph and 
> StreamingJobProgressListener to make the UI not block by job generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs

2017-01-11 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19182:


 Summary: Optimize the lock in StreamingJobProgressListener to not 
block when generating Streaming jobs
 Key: SPARK-19182
 URL: https://issues.apache.org/jira/browse/SPARK-19182
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Reporter: Shixiong Zhu
Priority: Minor


When DStreamGraph is generating a job, it will hold a lock and block other 
APIs. Because StreamingJobProgressListener needs to call DStreamGraph's methods 
to access some information, the UI may hang if generating a job is very slow 
(e.g., talking to the slow Kafka cluster to fetch metadata).

It's better to optimize the locks in DStreamGraph and 
StreamingJobProgressListener to make the UI not block by job generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19132) Add test cases for row size estimation

2017-01-11 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19132.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add test cases for row size estimation
> --
>
> Key: SPARK-19132
> URL: https://issues.apache.org/jira/browse/SPARK-19132
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> See https://github.com/apache/spark/pull/16430#discussion_r95040478
> getRowSize is mostly untested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-11 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819446#comment-15819446
 ] 

Shivaram Venkataraman commented on SPARK-18823:
---

Yeah I think it makes sense to not handle the case where we take a local 
vector. However adding support for `[` and `[[` to support literals and 
existing columns would be good. This is the only item remaining from what is 
summarized as #1 above I think ?

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819400#comment-15819400
 ] 

yucai commented on SPARK-19180:
---

Hi Owen,

Thanks a lot for comments, it is using unsafe API for OffHeapColumn, which 
should have no align.

See codes:
{code}
   @Override
   public void putShorts(int rowId, int count, short value) {
 long offset = data + 2 * rowId;
-for (int i = 0; i < count; ++i, offset += 4) {
+for (int i = 0; i < count; ++i, offset += 2) {
   Platform.putShort(null, offset, value);
 }
   }
{code}

And also, my testing:
{code}
scala> val column = ColumnVector.allocate(1024, ShortType, MemoryMode.OFF_HEAP)
column: org.apache.spark.sql.execution.vectorized.ColumnVector = 
org.apache.spark.sql.execution.vectorized.OffHeapColumnVector@56fc2cea

scala> column.putShorts(0, 4, 8.toShort)

scala> column.getShort(1)
res5: Short = 18432

scala> 

scala> val column = ColumnVector.allocate(1024, ShortType, MemoryMode.ON_HEAP)
column: org.apache.spark.sql.execution.vectorized.ColumnVector = 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector@7fb8d720

scala> column.putShorts(0, 4, 8.toShort)

scala> column.getShort(1)
res7: Short = 8
{code}

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18801) Support resolve a nested view

2017-01-11 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18801.
---
Resolution: Fixed
  Assignee: Jiang Xingbo

> Support resolve a nested view
> -
>
> Key: SPARK-18801
> URL: https://issues.apache.org/jira/browse/SPARK-18801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>
> We should be able to resolve a nested view. The main advantage is that if you 
> update an underlying view, the current view also gets updated.
> The new approach should be compatible with older versions of SPARK/HIVE, that 
> means:
>   1. The new approach should be able to resolve the views that created by 
> older versions of SPARK/HIVE;
>   2. The new approach should be able to resolve the views that are 
> currently supported by SPARK SQL.
> The new approach mainly brings in the following changes:
>   1. Add a new operator called `View` to keep track of the CatalogTable 
> that describes the view, and the output attributes as well as the child of 
> the view;
>   2. Update the `ResolveRelations` rule to resolve the relations and 
> views, note that a nested view should be resolved correctly;
>   3. Add `AnalysisContext` to enable us to still support a view created 
> with CTE/Windows query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19130) SparkR should support setting and adding new column with singular value implicitly

2017-01-11 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19130.
---
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.2.0
   2.1.1

Resolved by https://github.com/apache/spark/pull/16510

> SparkR should support setting and adding new column with singular value 
> implicitly
> --
>
> Key: SPARK-19130
> URL: https://issues.apache.org/jira/browse/SPARK-19130
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> for parity with framework like dplyr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-11 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819260#comment-15819260
 ] 

Shivaram Venkataraman commented on SPARK-19177:
---

Thanks [~masip85] - Can you include a small code snippet that shows the problem 
?

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2017-01-11 Thread Jose Soltren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819186#comment-15819186
 ] 

Jose Soltren commented on SPARK-19181:
--

SPARK-2208 disabled a similar metric previously.

> SparkListenerSuite.local metrics fails when average executorDeserializeTime 
> is too short.
> -
>
> Key: SPARK-19181
> URL: https://issues.apache.org/jira/browse/SPARK-19181
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>Priority: Minor
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249
> The "local metrics" test asserts that tasks should take more than 1ms on 
> average to complete, even though a code comment notes that this is a small 
> test and tasks may finish faster. I've been seeing some "failures" here on 
> fast systems that finish these tasks quite quickly.
> There are a few ways forward here:
> 1. Disable this test.
> 2. Relax this check.
> 3. Implement sub-millisecond granularity for task times throughout Spark.
> 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
> reference a partition that implements a custom Externalizable.readExternal, 
> which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2017-01-11 Thread Jose Soltren (JIRA)

Jose Soltren created SPARK-19181:


 Summary: SparkListenerSuite.local metrics fails when average 
executorDeserializeTime is too short.
 Key: SPARK-19181
 URL: https://issues.apache.org/jira/browse/SPARK-19181
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.0
Reporter: Jose Soltren
Priority: Minor


https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249

The "local metrics" test asserts that tasks should take more than 1ms on 
average to complete, even though a code comment notes that this is a small test 
and tasks may finish faster. I've been seeing some "failures" here on fast 
systems that finish these tasks quite quickly.

There are a few ways forward here:
1. Disable this test.
2. Relax this check.
3. Implement sub-millisecond granularity for task times throughout Spark.
4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
reference a partition that implements a custom Externalizable.readExternal, 
which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9435) Java UDFs don't work with GROUP BY expressions

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9435:
---

Assignee: Apache Spark

> Java UDFs don't work with GROUP BY expressions
> --
>
> Key: SPARK-9435
> URL: https://issues.apache.org/jira/browse/SPARK-9435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: All
>Reporter: James Aley
>Assignee: Apache Spark
> Attachments: IncMain.java, points.txt
>
>
> If you define a UDF in Java, for example by implementing the UDF1 interface, 
> then try to use that UDF on a column in both the SELECT and GROUP BY clauses 
> of a query, you'll get an error like this:
> {code}
> "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)"
> org.apache.spark.sql.AnalysisException: expression 'y' is neither present in 
> the group by, nor is it an aggregate function. Add to group by or wrap in 
> first() if you don't care which value you get.
> {code}
> We put together a minimal reproduction in the attached Java file, which makes 
> use of the data in the text file attached.
> I'm guessing there's some kind of issue with the equality implementation, so 
> Spark can't tell that those two expressions are the same maybe? If you do the 
> same thing from Scala, it works fine.
> Note for context: we ran into this issue while working around SPARK-9338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9435) Java UDFs don't work with GROUP BY expressions

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9435:
---

Assignee: (was: Apache Spark)

> Java UDFs don't work with GROUP BY expressions
> --
>
> Key: SPARK-9435
> URL: https://issues.apache.org/jira/browse/SPARK-9435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: All
>Reporter: James Aley
> Attachments: IncMain.java, points.txt
>
>
> If you define a UDF in Java, for example by implementing the UDF1 interface, 
> then try to use that UDF on a column in both the SELECT and GROUP BY clauses 
> of a query, you'll get an error like this:
> {code}
> "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)"
> org.apache.spark.sql.AnalysisException: expression 'y' is neither present in 
> the group by, nor is it an aggregate function. Add to group by or wrap in 
> first() if you don't care which value you get.
> {code}
> We put together a minimal reproduction in the attached Java file, which makes 
> use of the data in the text file attached.
> I'm guessing there's some kind of issue with the equality implementation, so 
> Spark can't tell that those two expressions are the same maybe? If you do the 
> same thing from Scala, it works fine.
> Note for context: we ran into this issue while working around SPARK-9338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9435) Java UDFs don't work with GROUP BY expressions

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819117#comment-15819117
 ] 

Apache Spark commented on SPARK-9435:
-

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16553

> Java UDFs don't work with GROUP BY expressions
> --
>
> Key: SPARK-9435
> URL: https://issues.apache.org/jira/browse/SPARK-9435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: All
>Reporter: James Aley
> Attachments: IncMain.java, points.txt
>
>
> If you define a UDF in Java, for example by implementing the UDF1 interface, 
> then try to use that UDF on a column in both the SELECT and GROUP BY clauses 
> of a query, you'll get an error like this:
> {code}
> "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)"
> org.apache.spark.sql.AnalysisException: expression 'y' is neither present in 
> the group by, nor is it an aggregate function. Add to group by or wrap in 
> first() if you don't care which value you get.
> {code}
> We put together a minimal reproduction in the attached Java file, which makes 
> use of the data in the text file attached.
> I'm guessing there's some kind of issue with the equality implementation, so 
> Spark can't tell that those two expressions are the same maybe? If you do the 
> same thing from Scala, it works fine.
> Note for context: we ran into this issue while working around SPARK-9338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819073#comment-15819073
 ] 

Sean Owen commented on SPARK-19180:
---

Are you sure? most stuff is int aligned in the JVM. You might be right but just 
making sure it is not merely because a short is 2 bytes

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-01-11 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819062#comment-15819062
 ] 

Seth Hendrickson commented on SPARK-17136:
--

I'm interested in working on this task including both driving the discussion 
and submitting an initial PR when it is time. I have the beginnings of a design 
document constructed 
[here|https://docs.google.com/document/d/1ynyTwlNw4b6DovG6m8okd3fD2PVZKCEq5rFfsg5Ba1k/edit?usp=sharing],
 and I'd like to open it up for community feedback and input. 

We do see requests from time to time for users to use their own optimizers in 
Spark ML algorithms and we have not supported it in Spark ML. With fairly 
minimal added code, we can make Spark ML optimizers pluggable which provides a 
tangible benefit to users. Potentially, we can design an API that has benefits 
beyond just that, and I'm interested to hear some of the other needs/wants 
people have.

cc [~dbtsai] [~yanboliang] [~WeichenXu123] [~josephkb] [~srowen]

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-11 Thread yucai (JIRA)

yucai created SPARK-19180:
-

 Summary: the offset of short is 4 in OffHeapColumnVector's 
putShorts
 Key: SPARK-19180
 URL: https://issues.apache.org/jira/browse/SPARK-19180
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: yucai
 Fix For: 2.2.0


the offset of short is 4 in OffHeapColumnVector's putShorts, actually it should 
be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2017-01-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17568.

   Resolution: Fixed
 Assignee: Bryan Cutler
Fix Version/s: 2.2.0

> Add spark-submit option for user to override ivy settings used to resolve 
> packages/artifacts
> 
>
> Key: SPARK-17568
> URL: https://issues.apache.org/jira/browse/SPARK-17568
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven 
> coordinates to package jars. Currently, the IvySettings are hard-coded with 
> Maven Central as the last repository in the chain of resolvers. 
> At IBM, we have heard from several enterprise clients that are frustrated 
> with lack of control over their local Spark installations. These clients want 
> to ensure that certain artifacts can be excluded or patched due to security 
> or license issues. For example, a package may use a vulnerable SSL protocol; 
> or a package may link against an AGPL library written by a litigious 
> competitor.
> While additional repositories and exclusions can be added on the spark-submit 
> command line, this falls short of what is needed. With Maven Central always 
> as a fall-back repository, it is difficult to ensure only approved artifacts 
> are used and it is often the exclusions that site admins are not aware of 
> that can cause problems. Also, known exclusions are better handled through a 
> centralized managed repository rather than as command line arguments.
> To resolve these issues, we propose the following change: allow the user to 
> specify an Ivy Settings XML file to pass in as an optional argument to 
> {{spark-submit}} (or specify in a config file) to define alternate 
> repositories used to resolve artifacts instead of the hard-coded defaults. 
> The use case for this would be to define a managed repository (such as Nexus) 
> in the settings file so that all requests for artifacts go through one 
> location only.
> Example usage:
> {noformat}
> $SPARK_HOME/bin/spark-submit --conf 
> spark.ivy.settings=/path/to/ivysettings.xml  myapp.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Dan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818849#comment-15818849
 ] 

Dan commented on SPARK-18075:
-

If he is running into the same issue with spark-shell, which is one of the 
official ways to run a Spark application, then supposedly it is a real bug and 
fixing that would also fix the issue that occurs when running from an IDE :-)

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with

[jira] [Assigned] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19152:


Assignee: Apache Spark

> DataFrameWriter.saveAsTable should work with hive format with append mode
> -
>
> Key: SPARK-19152
> URL: https://issues.apache.org/jira/browse/SPARK-19152
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818731#comment-15818731
 ] 

Apache Spark commented on SPARK-19152:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16552

> DataFrameWriter.saveAsTable should work with hive format with append mode
> -
>
> Key: SPARK-19152
> URL: https://issues.apache.org/jira/browse/SPARK-19152
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19152:


Assignee: (was: Apache Spark)

> DataFrameWriter.saveAsTable should work with hive format with append mode
> -
>
> Key: SPARK-19152
> URL: https://issues.apache.org/jira/browse/SPARK-19152
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818720#comment-15818720
 ] 

Sean Owen commented on SPARK-18075:
---

Yes, spark-shell is submitted the same way. If you wrote some code that did its 
work given an existing SparkContext/SparkSession and then invoked it in the 
shell, it should be fine. I think this was about launching a Spark job by 
running a class directly as if it were any other program. That also can work, 
but, may require additional work to accomplish comparable setup.

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
>

[jira] [Commented] (SPARK-19090) Dynamic Resource Allocation not respecting spark.executor.cores

2017-01-11 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818716#comment-15818716
 ] 

nirav patel commented on SPARK-19090:
-

[~q79969786] As I mentioned in previous comment it does work for me when I set 
parameter on command line but it doesn't work when I set it via SparkConf in my 
application class.

> Dynamic Resource Allocation not respecting spark.executor.cores
> ---
>
> Key: SPARK-19090
> URL: https://issues.apache.org/jira/browse/SPARK-19090
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2, 1.6.1, 2.0.1
>Reporter: nirav patel
>
> When enabling dynamic scheduling with yarn I see that all executors are using 
> only 1 core even if I specify "spark.executor.cores" to 6. If dynamic 
> scheduling is disabled then each executors will have 6 cores. i.e. it 
> respects  "spark.executor.cores". I have tested this against spark 1.5 . I 
> think it will be the same behavior with 2.x as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

2017-01-11 Thread Shuai Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818648#comment-15818648
 ] 

Shuai Lin commented on SPARK-17101:
---

Seems this issue has already been resolved by 
https://github.com/apache/spark/pull/14680 ? cc [~rxin]

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> --
>
> Key: SPARK-17101
> URL: https://issues.apache.org/jira/browse/SPARK-17101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>+- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> == Physical Plan ==
> InMemoryTableScan [value#24]
>+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>  +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Nick Orka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818645#comment-15818645
 ] 

Nick Orka commented on SPARK-18075:
---

This is really cool conversation, but how about if I run it in spark-shell. Is 
it supposed to make same setup as spark-submit does? The cool thing is that 
this doesn't work in spark-shell as well.

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
>

[jira] [Created] (SPARK-19179) spark.yarn.access.namenodes description is wrong

2017-01-11 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-19179:
-

 Summary: spark.yarn.access.namenodes description is wrong
 Key: SPARK-19179
 URL: https://issues.apache.org/jira/browse/SPARK-19179
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.2
Reporter: Thomas Graves
Priority: Minor


The description and name of spark.yarn.access.namenodes  is off.  It says this 
is for HDFS namenodes when really this is to specify any hadoop filesystems.  
It gets the credentials for those filesystems.

We should at least update the description on it to be more generic.  We could 
change the name on it but we would have to deprecated it and keep around 
current name as many people use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2017-01-11 Thread roncenzhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818635#comment-15818635
 ] 

roncenzhao commented on SPARK-19169:


I have the two doubts:
1. In the method `HiveTableScanExec.addColumnMetadataToConf(conf)`, we set the 
`serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` into 
`hadoopConf`.
2. In the method `HadoopTableReader.initializeLocalJobConfFunc(path, 
tableDesc)`, we set the table's properties which include 
`serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` into 
jobConf.

I think it's the two points that cause this problem. When I remove this two 
methods, the sql will run successfully.

I don't know why we must set the `serdeConstants.LIST_COLUMN_TYPES` and 
`serdeConstants.LIST_COLUMNS` into jobConf.

Thanks~

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was

[jira] [Resolved] (SPARK-19021) Generailize HDFSCredentialProvider to support non HDFS security FS

2017-01-11 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-19021.
---
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0

> Generailize HDFSCredentialProvider to support non HDFS security FS
> --
>
> Key: SPARK-19021
> URL: https://issues.apache.org/jira/browse/SPARK-19021
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently Spark can only get token renewal interval from security HDFS 
> (hdfs://), if Spark runs with other security file systems like webHDFS 
> (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get 
> token renewal intervals from these tokens. These will make Spark unable to 
> work with these security clusters. So instead of only checking HDFS token, we 
> should generalize to support different {{DelegationTokenIdentifier}}.
> This is a follow-up work of SPARK-18840.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818599#comment-15818599
 ] 

Sean Owen commented on SPARK-19169:
---

It sounds like you're saying you read the data with the wrong schema on purpose 
(?) -- how does this relate to what Spark does?
Or, why not let the ORC reader get the schema?

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818594#comment-15818594
 ] 

Sean Owen commented on SPARK-13198:
---

I think that's up to you if you're interested in this? it's not clear what the 
issue is, but it's also not supported usage.

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818591#comment-15818591
 ] 

Sean Owen commented on SPARK-18075:
---

It's possible in many cases already and always has been. Obviously, the unit 
tests already work that way and are runnable from an IDE. There is no reason 
you can't, but also, no reason to expect that it all Just Works without doing 
some of the setup spark-submit may do for you. What if any of that is necessary 
depends on what one is running.

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
>

[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2017-01-11 Thread roncenzhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818552#comment-15818552
 ] 

roncenzhao commented on SPARK-19169:


I do not think this is a misusage of ORC.
If we do not set the `serdeConstants.LIST_COLUMN_TYPES` and 
`serdeConstants.LIST_COLUMNS`　of table schema into `hadoopConf` and let the orc 
reader get the schema info by reading orc file, we can run the sql 
successfully. 

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818540#comment-15818540
 ] 

Wenchen Fan commented on SPARK-18075:
-

Although it's not a bug, I think this could be a very cool feature: running 
spark applications using IDE. I think we should read the spark-submit script to 
see how we launch a spark application and how can we do it with IDE.

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore

[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-01-11 Thread Michael David Pedersen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818526#comment-15818526
 ] 

Michael David Pedersen commented on SPARK-18075:


I'm encountering this problem too, in the context of a custom RDD rather than 
UDFs, but similarly running the Spark driver as part of my application. This is 
actually a web application (to enable notebook-like work flows), so my 
motivation is not "just" to use a development setup. I'm pretty sure I've ruled 
out any obvious Spark version mismatches between the driver and the cluster.

I would like to investigate this further. Any ideas of what the cause might be 
or where to start?

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19

[jira] [Comment Edited] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2017-01-11 Thread Dmytro Bielievtsov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818498#comment-15818498
 ] 

Dmytro Bielievtsov edited comment on SPARK-13198 at 1/11/17 2:38 PM:
-

[~srowen] Looks like a growing number of people needs this functionality. As 
some who knows the codebase, can you give a rough estimate of the amount of 
work it might take to make Spark guarantee a good cleanup, equivalent to the 
JVM shutdown? Or maybe one could hack this away by somehow restarting the 
corresponding JVM without exiting current python interpreter? If this is 
reasonable amount of work, I might try to cut out some of our team's time to 
work on the corresponding pull request. Thanks!


was (Author: belevtsoff):
[~srowen] Looks like a growing number of people needs this functionality. As 
some who knows the codebase, can you give a rough estimate of the amount of 
work it might take to make Spark guarantee a good cleanup, equivalent to the 
JVM shutdown? Or maybe one could hack this away by somehow restarting the 
corresponding JVM without exiting current python interpreter? If this is 
reasonable amount of work, I might try to cut out some of our team's time to 
work on the corresponding pull request.

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2017-01-11 Thread Dmytro Bielievtsov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818498#comment-15818498
 ] 

Dmytro Bielievtsov commented on SPARK-13198:


[~srowen] Looks like a growing number of people needs this functionality. As 
some who knows the codebase, can you give a rough estimate of the amount of 
work it might take to make Spark guarantee a good cleanup, equivalent to the 
JVM shutdown? Or maybe one could hack this away by somehow restarting the 
corresponding JVM without exiting current python interpreter? If this is 
reasonable amount of work, I might try to cut out some of our team's time to 
work on the corresponding pull request.

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19132) Add test cases for row size estimation

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818492#comment-15818492
 ] 

Apache Spark commented on SPARK-19132:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16551

> Add test cases for row size estimation
> --
>
> Key: SPARK-19132
> URL: https://issues.apache.org/jira/browse/SPARK-19132
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Zhenhua Wang
>
> See https://github.com/apache/spark/pull/16430#discussion_r95040478
> getRowSize is mostly untested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19132) Add test cases for row size estimation

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19132:


Assignee: Zhenhua Wang  (was: Apache Spark)

> Add test cases for row size estimation
> --
>
> Key: SPARK-19132
> URL: https://issues.apache.org/jira/browse/SPARK-19132
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Zhenhua Wang
>
> See https://github.com/apache/spark/pull/16430#discussion_r95040478
> getRowSize is mostly untested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19178) convert string of large numbers to int should return null

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19178:


Assignee: Wenchen Fan  (was: Apache Spark)

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19132) Add test cases for row size estimation

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19132:


Assignee: Apache Spark  (was: Zhenhua Wang)

> Add test cases for row size estimation
> --
>
> Key: SPARK-19132
> URL: https://issues.apache.org/jira/browse/SPARK-19132
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See https://github.com/apache/spark/pull/16430#discussion_r95040478
> getRowSize is mostly untested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19178) convert string of large numbers to int should return null

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818477#comment-15818477
 ] 

Apache Spark commented on SPARK-19178:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16550

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19178) convert string of large numbers to int should return null

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19178:


Assignee: Apache Spark  (was: Wenchen Fan)

> convert string of large numbers to int should return null
> -
>
> Key: SPARK-19178
> URL: https://issues.apache.org/jira/browse/SPARK-19178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19178) convert string of large numbers to int should return null

2017-01-11 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-19178:
---

 Summary: convert string of large numbers to int should return null
 Key: SPARK-19178
 URL: https://issues.apache.org/jira/browse/SPARK-19178
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818366#comment-15818366
 ] 

Apache Spark commented on SPARK-19151:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16549

> DataFrameWriter.saveAsTable should work with hive format with overwrite mode
> 
>
> Key: SPARK-19151
> URL: https://issues.apache.org/jira/browse/SPARK-19151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19151:


Assignee: Apache Spark

> DataFrameWriter.saveAsTable should work with hive format with overwrite mode
> 
>
> Key: SPARK-19151
> URL: https://issues.apache.org/jira/browse/SPARK-19151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19151:


Assignee: (was: Apache Spark)

> DataFrameWriter.saveAsTable should work with hive format with overwrite mode
> 
>
> Key: SPARK-19151
> URL: https://issues.apache.org/jira/browse/SPARK-19151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19175) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2017-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818344#comment-15818344
 ] 

Sean Owen commented on SPARK-19175:
---

No, continue on the JIRA I left open, SPARK-19169. This sounds like a misusage 
of ORC though.

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19175
> URL: https://issues.apache.org/jira/browse/SPARK-19175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: spark2.0.2
>Reporter: roncenzhao
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> But hive can read it successfully.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19158) ml.R example fails in yarn-cluster mode due to lacks of e1071 package

2017-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818300#comment-15818300
 ] 

Apache Spark commented on SPARK-19158:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16548

> ml.R example fails in yarn-cluster mode due to lacks of e1071 package
> -
>
> Key: SPARK-19158
> URL: https://issues.apache.org/jira/browse/SPARK-19158
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Yesha Vora
>
> ml.R application fails in spark2 with yarn-cluster mode.
> {code}
> spark-submit --master yarn-cluster examples/src/main/r/ml/ml.R {code}
> {code:title=application log}
> 17/01/03 04:35:30 INFO MemoryStore: Block broadcast_88 stored as values in 
> memory (estimated size 6.8 KB, free 407.6 MB)
> 17/01/03 04:35:30 INFO BufferedStreamThread: Error : 
> requireNamespace("e1071", quietly = TRUE) is not TRUE
> 17/01/03 04:35:30 ERROR Executor: Exception in task 0.0 in stage 65.0 (TID 65)
> org.apache.spark.SparkException: R computation failed with
>  Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 68
> 17/01/03 04:35:30 INFO Executor: Running task 3.0 in stage 65.0 (TID 68)
> 17/01/03 04:35:30 INFO BufferedStreamThread: Error : 
> requireNamespace("e1071", quietly = TRUE) is not TRUE
> 17/01/03 04:35:30 ERROR Executor: Exception in task 3.0 in stage 65.0 (TID 68)
> org.apache.spark.SparkException: R computation failed with
>  Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
> Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 70
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19158) ml.R example fails in yarn-cluster mode due to lacks of e1071 package

2017-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19158:


Assignee: (was: Apache Spark)

> ml.R example fails in yarn-cluster mode due to lacks of e1071 package
> -
>
> Key: SPARK-19158
> URL: https://issues.apache.org/jira/browse/SPARK-19158
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Yesha Vora
>
> ml.R application fails in spark2 with yarn-cluster mode.
> {code}
> spark-submit --master yarn-cluster examples/src/main/r/ml/ml.R {code}
> {code:title=application log}
> 17/01/03 04:35:30 INFO MemoryStore: Block broadcast_88 stored as values in 
> memory (estimated size 6.8 KB, free 407.6 MB)
> 17/01/03 04:35:30 INFO BufferedStreamThread: Error : 
> requireNamespace("e1071", quietly = TRUE) is not TRUE
> 17/01/03 04:35:30 ERROR Executor: Exception in task 0.0 in stage 65.0 (TID 65)
> org.apache.spark.SparkException: R computation failed with
>  Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 68
> 17/01/03 04:35:30 INFO Executor: Running task 3.0 in stage 65.0 (TID 68)
> 17/01/03 04:35:30 INFO BufferedStreamThread: Error : 
> requireNamespace("e1071", quietly = TRUE) is not TRUE
> 17/01/03 04:35:30 ERROR Executor: Exception in task 3.0 in stage 65.0 (TID 68)
> org.apache.spark.SparkException: R computation failed with
>  Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
> Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 70
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 141 matches

Mail list logo