[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820389#comment-15820389 ] Felix Cheung commented on SPARK-18823: -- Yap. I'll start on this shortly. > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19179) spark.yarn.access.namenodes description is wrong
[ https://issues.apache.org/jira/browse/SPARK-19179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820371#comment-15820371 ] Saisai Shao commented on SPARK-19179: - Thanks [~tgraves] to point out the left thing, let me handle it. > spark.yarn.access.namenodes description is wrong > > > Key: SPARK-19179 > URL: https://issues.apache.org/jira/browse/SPARK-19179 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.2 >Reporter: Thomas Graves >Priority: Minor > > The description and name of spark.yarn.access.namenodesis off. It > says this is for HDFS namenodes when really this is to specify any hadoop > filesystems. It gets the credentials for those filesystems. > We should at least update the description on it to be more generic. We could > change the name on it but we would have to deprecated it and keep around > current name as many people use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12076) countDistinct behaves inconsistently
[ https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-12076. -- Resolution: Cannot Reproduce I strongly think no one is going to reproduce this. I tried to imagine and generate data and test it for few hours but I can't reproduce. I am resolving this as {{Cannot Reproduce}}. Please reopen this if anyone can verify this still exists in the current master. > countDistinct behaves inconsistently > > > Key: SPARK-12076 > URL: https://issues.apache.org/jira/browse/SPARK-12076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Paul Zaczkieiwcz >Priority: Minor > > Assume: > {code:java} > val slicePlayed:DataFrame = _ > val joinKeys:DataFrame = _ > {code} > Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} > and all columns beginning with "join_" are from {{joinKeys}}. The following > queries can return different values for slice_count_distinct: > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number") > ).show(false) > {code} > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > min($"cdnt_event_time").as("slice_start_time"), > min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"), > min($"cdnt_user_ip").as("slice_played_user_ip"), > min($"cdnt_user_agent").as("slice_played_user_agent"), > min($"cdnt_referer").as("slice_played_referer"), > max($"cdnt_event_time").as("slice_end_time"), > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number"), > min($"cdnt_is_live").as("is_live") > ).show(false) > {code} > The +only+ difference between the two queries are that I'm adding more > columns to the {{agg}} method. > I can't reproduce by manually creating a dataFrame from > {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet > files. > The explain plans for the two queries are slightly different. > {code} > == Physical Plan == > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], > functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDistinct=false),(count(cdnt_slice_number#24L),mode=Complete,isDistinct=true)], > > output=[slice_played_session_id#780,slice_played_asset_id#781,slice_played_euid#782,slice_count_distinct#783L,slice_count_total#784L,min_slice_number#785L,max_slice_number#786L]) > > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L], > > functions=[(count(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(min(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(max(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false)], > > output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L]) > > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L], > > functions=[(count(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(min(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(max(cdnt_slice_number#24L),mode=Partial,isDistinct=false)], > > output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L]) >TungstenProject > [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L] > SortMergeJoin [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], > [join_session_id#41,join_asset_id#42,join_euid#43] > TungstenSort [cdnt_session_id#23 ASC,cdnt_asset_id#5 ASC,cdnt_euid#13 > ASC], false, 0 > TungstenExchange > hashpartitioning(cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13) >ConvertToUnsafe >
[jira] [Assigned] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18693: Assignee: Apache Spark > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh >Assignee: Apache Spark > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18693: Assignee: (was: Apache Spark) > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820273#comment-15820273 ] Apache Spark commented on SPARK-18693: -- User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/16557 > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16848) Check schema validation for user-specified schema in jdbc and table APIs
[ https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-16848. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.2.0 > Check schema validation for user-specified schema in jdbc and table APIs > > > Key: SPARK-16848 > URL: https://issues.apache.org/jira/browse/SPARK-16848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > Currently, > Both APIs below: > {code} > spark.read.schema(StructType(Nil)).jdbc(...) > {code} > and > {code} > spark.read.schema(StructType(Nil)).table("usrdb.test") > {code} > ignore schemas. > It'd make sense to throw an exception rather than causing confusion to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:59 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: * FirstOrderMinimizer * DiffFunction interface * LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) * DistributedVector abstract interface *BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:58 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: * FirstOrderMinimizer * DiffFunction interface * LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) * DistributedVector abstract interface *BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM: - As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizer DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM: - As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180 ] Weichen Xu commented on SPARK-10078: As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819851#comment-15819851 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:43 AM: - [~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of features in high efficiency ? If only the interface supporting distributed vector, but internal computation still use local vector and/or local matrix, then it seems won't make much sense... Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be computed in parallel ? I look into L-BFGS-B code in breeze and the core updating Hessian and computing descent direction in L-BFGS-B is very complex, this part it cannot reuse LBFGS code. So, through which way LBFGS-B can take advantage of `Vector-free LBFGS` ? was (Author: weichenxu123): [~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of features in high efficiency ? If only the interface supporting distributed vector, but internal computation still use local vector and/or local matrix, then it seems won't make much sense... Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be computed in parallel ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19051) test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-19051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819957#comment-15819957 ] Hyukjin Kwon edited comment on SPARK-19051 at 1/12/17 4:36 AM: --- I just wonder if this JIRA says a flaky test or a constantly failing test. was (Author: hyukjin.kwon): I just if this JIRA says a flaky test or a constantly failing test. > test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in > python/run-tests > --- > > Key: SPARK-19051 > URL: https://issues.apache.org/jira/browse/SPARK-19051 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL, Tests >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04, ppc64le, x86_64 >Reporter: Nirman Narang >Priority: Minor > > Full log here. > {code:title=python/run-tests|borderStyle=solid} > Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache > The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars > file:/tmp/tmp7Ie4FN added as a remote repository with the name: repo-1 > :: loading settings :: url = > jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > a#mylib added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found a#mylib;0.1 in repo-1 > :: resolution report :: resolve 428ms :: artifacts dl 5ms > :: modules in use: > a#mylib;0.1 from repo-1 in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 1 | 0 | 0 | 0 || 1 | 0 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 0 artifacts copied, 1 already retrieved (0kB/16ms) > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache > The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars > file:/tmp/tmpo5SXug added as a remote repository with the name: repo-1 > :: loading settings :: url = > jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > a#mylib added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found a#mylib;0.1 in repo-1 > :: resolution report :: resolve 320ms :: artifacts dl 4ms > :: modules in use: > a#mylib;0.1 from repo-1 in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 1 | 0 | 0 | 0 || 1 | 0 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 0 artifacts copied, 1 already retrieved (0kB/7ms) > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > ... > [Stage 23:==> (143 + 5) / > 200] > [Stage 23:==> (188 + 5) / > 200] > > > . > [Stage 88:> (122 + 4) / > 200] > [Stage 88:=>(198 + 2) / > 200] > > > [Stage 90:> (0 + 4) / > 200] > [Stage 90:===> (14 + 4) / > 200] > [Stage 90:===> (28 + 4) / > 200] > [Stage 90:===> (42 + 4) / > 200] > [Stage 90:===> (57 + 4) / > 200] > [Stage
[jira] [Updated] (SPARK-19188) Run spark in scala as script file, note not just REPL
[ https://issues.apache.org/jira/browse/SPARK-19188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-19188: --- Description: Hi, I'm looking for the feature to run spark/scala in script file. The current spark-shell is a REPL and doesn't not exit with proper exit values if any error happens. How to implement such a feature? Thanks was: Hi, I'm looking for the feature to run spark/scala in script file. The current spark-shell is a REPL and doesn't not exit if any error happens. How to implement such a feature? Thanks > Run spark in scala as script file, note not just REPL > - > > Key: SPARK-19188 > URL: https://issues.apache.org/jira/browse/SPARK-19188 > Project: Spark > Issue Type: New Feature >Reporter: wangzhihao > Labels: newbie > > Hi, I'm looking for the feature to run spark/scala in script file. The > current spark-shell is a REPL and doesn't not exit with proper exit values if > any error happens. > How to implement such a feature? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19188) Run spark in scala as script file, note not just REPL
[ https://issues.apache.org/jira/browse/SPARK-19188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-19188: --- Description: Hi, I'm looking for the feature to run spark/scala in script file. The current spark-shell is a REPL and doesn't not exit if any error happens. How to implement such a feature? Thanks was: Hi, I'm looking for the feature to run spark/scala in script file. The current spark-submit is an REPL and doesn't not exit if any error happens. How to implement such a feature? Thanks > Run spark in scala as script file, note not just REPL > - > > Key: SPARK-19188 > URL: https://issues.apache.org/jira/browse/SPARK-19188 > Project: Spark > Issue Type: New Feature >Reporter: wangzhihao > Labels: newbie > > Hi, I'm looking for the feature to run spark/scala in script file. The > current spark-shell is a REPL and doesn't not exit if any error happens. > How to implement such a feature? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19188) Run spark in scala as script file, note not just REPL
wangzhihao created SPARK-19188: -- Summary: Run spark in scala as script file, note not just REPL Key: SPARK-19188 URL: https://issues.apache.org/jira/browse/SPARK-19188 Project: Spark Issue Type: New Feature Reporter: wangzhihao Hi, I'm looking for the feature to run spark/scala in script file. The current spark-submit is an REPL and doesn't not exit if any error happens. How to implement such a feature? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19133) SparkR glm Gamma family results in error
[ https://issues.apache.org/jira/browse/SPARK-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19133: - Affects Version/s: 2.0.0 Target Version/s: 2.0.3, 2.1.1, 2.2.0 (was: 2.2.0) Fix Version/s: 2.1.1 2.0.3 > SparkR glm Gamma family results in error > > > Key: SPARK-19133 > URL: https://issues.apache.org/jira/browse/SPARK-19133 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > > glm(y~1,family=Gamma, data = dy) > 17/01/09 06:10:47 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) > at > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.IllegalArgumentException: glm_e3483764cdf9 parameter > family given invalid value Gamma. > at org.apache.spark.ml.param.Param.validate(params.scala:77) > at org.apache.spark.ml.param.ParamPair.(params.scala:528) > at
[jira] [Created] (SPARK-19187) querying from parquet partitioned table throws FileNotFoundException when some partitions' hdfs locations do not exist
roncenzhao created SPARK-19187: -- Summary: querying from parquet partitioned table throws FileNotFoundException when some partitions' hdfs locations do not exist Key: SPARK-19187 URL: https://issues.apache.org/jira/browse/SPARK-19187 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: roncenzhao Hi, all. When the parquet partitioned table's some partition's hdfs paths do not exist, querying from it throws FileNotFoundException . The error stack is : ` TaskSetManager: Lost task 522.0 in stage 1.0 (TID 523, sd-hadoop-datanode-50-135.i dc.vip.com): java.io.FileNotFoundException: File does not exist: hdfs://bipcluster/bip/external_table/vip dw/dw_log_app_pageinfo_clean_spark_parquet/dt=20161223/hm=1730 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply( fileSourceInterfaces.scala:465) at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply( fileSourceInterfaces.scala:462) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception
Adrian Schulewitz created SPARK-19186: - Summary: Hash symbol in middle of Sybase database table name causes Spark Exception Key: SPARK-19186 URL: https://issues.apache.org/jira/browse/SPARK-19186 Project: Spark Issue Type: Bug Affects Versions: 2.1.0 Reporter: Adrian Schulewitz If I use a table name without a '#' symbol in the middle then no exception occurs but with one an exception is thrown. According to Sybase 15 documentation a '#' is a legal character. val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF" val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via SQL").setMaster("local[2]") val sess = SparkSession .builder() .appName("MUREX DMart Simple SQL Reader") .config(conf) .getOrCreate() import sess.implicits._ val df = sess.read .format("jdbc") .option("url", "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56") .option("driver", "net.sourceforge.jtds.jdbc.Driver") .option("dbtable", "CTP#ADR_TYPE_DBF") .option("UDT_DEALCRD_REP", "mxdmart56") .option("user", "INSTAL") .option("password", "INSTALL") .load() df.createOrReplaceTempView("trades") val resultsDF = sess.sql(testSql) resultsDF.show() 17/01/12 14:30:01 INFO SharedState: Warehouse path is 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'. 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM CTP#ADR_TYPE_DBF Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 17) == SQL == SELECT * FROM CTP#ADR_TYPE_DBF -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at com.anz.murex.hcp.poc.hcp.api.MurexDatamartSqlReader$.main(MurexDatamartSqlReader.scala:94) at com.anz.murex.hcp.poc.hcp.api.MurexDatamartSqlReader.main(MurexDatamartSqlReader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 3:02 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL:
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:55 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >
[jira] [Commented] (SPARK-14901) java exception when showing join
[ https://issues.apache.org/jira/browse/SPARK-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819984#comment-15819984 ] Brent Elmer commented on SPARK-14901: - Netezza is an IBM product so there is no place to download it from. I don't know if the problem only occurs when using Netezza or not. I wouldn't have a reproducer any smaller than the snippet of code in the bug report. Brent > java exception when showing join > > > Key: SPARK-14901 > URL: https://issues.apache.org/jira/browse/SPARK-14901 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Brent Elmer > > I am using pyspark with netezza. I am getting a java exception when trying > to show the first row of a join. I can show the first row for of the two > dataframes separately but not the result of a join. I get the same error for > any action I take(first, collect, show). Am I doing something wrong? > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > dispute_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table1', > driver='com.ibm.spark.netezza').load() > dispute_df.printSchema() > comments_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table2', > driver='com.ibm.spark.netezza').load() > comments_df.printSchema() > dispute_df.join(comments_df, dispute_df.COMMENTID == > comments_df.COMMENTID).first() > root > |-- COMMENTID: string (nullable = true) > |-- EXPORTDATETIME: timestamp (nullable = true) > |-- ARTAGS: string (nullable = true) > |-- POTAGS: string (nullable = true) > |-- INVTAG: string (nullable = true) > |-- ACTIONTAG: string (nullable = true) > |-- DISPUTEFLAG: string (nullable = true) > |-- ACTIONFLAG: string (nullable = true) > |-- CUSTOMFLAG1: string (nullable = true) > |-- CUSTOMFLAG2: string (nullable = true) > root > |-- COUNTRY: string (nullable = true) > |-- CUSTOMER: string (nullable = true) > |-- INVNUMBER: string (nullable = true) > |-- INVSEQNUMBER: string (nullable = true) > |-- LEDGERCODE: string (nullable = true) > |-- COMMENTTEXT: string (nullable = true) > |-- COMMENTTIMESTAMP: timestamp (nullable = true) > |-- COMMENTLENGTH: long (nullable = true) > |-- FREEINDEX: long (nullable = true) > |-- COMPLETEDFLAG: long (nullable = true) > |-- ACTIONFLAG: long (nullable = true) > |-- FREETEXT: string (nullable = true) > |-- USERNAME: string (nullable = true) > |-- ACTION: string (nullable = true) > |-- COMMENTID: string (nullable = true) > --- > Py4JJavaError Traceback (most recent call last) > in () > 5 comments_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table2', > driver='com.ibm.spark.netezza').load() > 6 comments_df.printSchema() > > 7 dispute_df.join(comments_df, dispute_df.COMMENTID == > comments_df.COMMENTID).first() > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in first(self) > 802 Row(age=2, name=u'Alice') > 803 """ > --> 804 return self.head() > 805 > 806 @ignore_unicode_prefix > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in head(self, n) > 790 """ > 791 if n is None: > --> 792 rs = self.head(1) > 793 return rs[0] if rs else None > 794 return self.take(n) > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in head(self, n) > 792 rs = self.head(1) > 793 return rs[0] if rs else None > --> 794 return self.take(n) > 795 > 796 @ignore_unicode_prefix > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in take(self, num) > 304 with SCCallSiteSync(self._sc) as css: > 305 port = > self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( > --> 306 self._jdf, num) > 307 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 308 > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 811 answer = self.gateway_client.send_command(command) > 812 return_value = get_return_value( > --> 813 answer, self.gateway_client, self.target_id, self.name) > 814 > 815 for temp_arg in temp_args: >
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:48 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee:
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:45 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been persisted, but already been unpersisted. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: >
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu commented on SPARK-10078: [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been persisted, but already been unpersisted. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19051) test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-19051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819957#comment-15819957 ] Hyukjin Kwon commented on SPARK-19051: -- I just if this JIRA says a flaky test or a constantly failing test. > test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) fails in > python/run-tests > --- > > Key: SPARK-19051 > URL: https://issues.apache.org/jira/browse/SPARK-19051 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL, Tests >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04, ppc64le, x86_64 >Reporter: Nirman Narang >Priority: Minor > > Full log here. > {code:title=python/run-tests|borderStyle=solid} > Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache > The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars > file:/tmp/tmp7Ie4FN added as a remote repository with the name: repo-1 > :: loading settings :: url = > jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > a#mylib added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found a#mylib;0.1 in repo-1 > :: resolution report :: resolve 428ms :: artifacts dl 5ms > :: modules in use: > a#mylib;0.1 from repo-1 in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 1 | 0 | 0 | 0 || 1 | 0 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 0 artifacts copied, 1 already retrieved (0kB/16ms) > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache > The jars for the packages stored in: /var/lib/jenkins/.ivy2/jars > file:/tmp/tmpo5SXug added as a remote repository with the name: repo-1 > :: loading settings :: url = > jar:file:/var/lib/jenkins/workspace/Sparkv2.0.1/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > a#mylib added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found a#mylib;0.1 in repo-1 > :: resolution report :: resolve 320ms :: artifacts dl 4ms > :: modules in use: > a#mylib;0.1 from repo-1 in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 1 | 0 | 0 | 0 || 1 | 0 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 0 artifacts copied, 1 already retrieved (0kB/7ms) > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > .Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > Picked up _JAVA_OPTIONS: -Xms1024m -Xmx4096m > ... > [Stage 23:==> (143 + 5) / > 200] > [Stage 23:==> (188 + 5) / > 200] > > > . > [Stage 88:> (122 + 4) / > 200] > [Stage 88:=>(198 + 2) / > 200] > > > [Stage 90:> (0 + 4) / > 200] > [Stage 90:===> (14 + 4) / > 200] > [Stage 90:===> (28 + 4) / > 200] > [Stage 90:===> (42 + 4) / > 200] > [Stage 90:===> (57 + 4) / > 200] > [Stage 90:> (73 + 4) / > 200] > [Stage 90:===> (86 + 4) / > 200]
[jira] [Resolved] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv
[ https://issues.apache.org/jira/browse/SPARK-17923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17923. -- Resolution: Duplicate This should be fixed in 2.0.1 and 2.1.0. > dateFormat unexpected kwarg to df.write.csv > --- > > Key: SPARK-17923 > URL: https://issues.apache.org/jira/browse/SPARK-17923 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > Calling like this: > {code}writer.csv(path, header=header, sep=sep, compression=compression, > dateFormat=date_format){code} > Getting the following error: > {code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code} > This error comes after being called with {code}date_format='-MM-dd'{code} > as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.
[ https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19184: Assignee: Apache Spark > Improve numerical stability for method tallSkinnyQR. > > > Key: SPARK-19184 > URL: https://issues.apache.org/jira/browse/SPARK-19184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Huamin Li >Assignee: Apache Spark >Priority: Minor > Labels: None > > In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github > Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]). > When the upper triangular matrix R is ill-conditioned, computing the inverse > of R can result in catastrophic cancellation. Instead, we should consider > using a forward solve for solving Q such that Q * R = A. > I first create a 4 by 4 RowMatrix A = > (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method > tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this > case, A is ill-conditioned and so is R. > See codes in Spark Shell: > {code:none} > import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors} > import org.apache.spark.mllib.linalg.distributed.RowMatrix > // Create RowMatrix A. > val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), > Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14)) > val denseMat = new RowMatrix(sc.parallelize(mat, 2)) > // Apply tallSkinnyQR to A. > val result = denseMat.tallSkinnyQR(true) > // Print the calculated Q and R. > result.Q.rows.collect.foreach(println) > result.R > // Calculate Q*R. Ideally, this should be close to A. > val reconstruct = result.Q.multiply(result.R) > reconstruct.rows.collect.foreach(println) > // Calculate Q'*Q. Ideally, this should be close to the identity matrix. > result.Q.computeGramianMatrix() > System.exit(0) > {code} > it will output the following results: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,1.5416524685312E13] > [0.0,0.,0.0,8011776.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.15416524685312] > [0.0,9.999E-6,0.,1.0008011776] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 1.5416524685312E13 > 0.0 0.9998 0.0 8011775.9 > 0.0 0.0 1.0 0.0 > 1.5416524685312E13 8011775.9 0.0 2.3766923337289844E26 > {code} > With forward solve for solving Q such that Q * R = A rather than computing > the inverse of R, it will output the following results instead: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,0.0] > [0.0,1.0,0.0,0.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.0] > [0.0,1.0E-5,1.0,1.0] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 0.0 > 0.0 1.0 0.0 0.0 > 0.0 0.0 1.0 0.0 > 0.0 0.0 0.0 1.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.
[ https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19184: Assignee: (was: Apache Spark) > Improve numerical stability for method tallSkinnyQR. > > > Key: SPARK-19184 > URL: https://issues.apache.org/jira/browse/SPARK-19184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Huamin Li >Priority: Minor > Labels: None > > In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github > Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]). > When the upper triangular matrix R is ill-conditioned, computing the inverse > of R can result in catastrophic cancellation. Instead, we should consider > using a forward solve for solving Q such that Q * R = A. > I first create a 4 by 4 RowMatrix A = > (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method > tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this > case, A is ill-conditioned and so is R. > See codes in Spark Shell: > {code:none} > import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors} > import org.apache.spark.mllib.linalg.distributed.RowMatrix > // Create RowMatrix A. > val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), > Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14)) > val denseMat = new RowMatrix(sc.parallelize(mat, 2)) > // Apply tallSkinnyQR to A. > val result = denseMat.tallSkinnyQR(true) > // Print the calculated Q and R. > result.Q.rows.collect.foreach(println) > result.R > // Calculate Q*R. Ideally, this should be close to A. > val reconstruct = result.Q.multiply(result.R) > reconstruct.rows.collect.foreach(println) > // Calculate Q'*Q. Ideally, this should be close to the identity matrix. > result.Q.computeGramianMatrix() > System.exit(0) > {code} > it will output the following results: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,1.5416524685312E13] > [0.0,0.,0.0,8011776.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.15416524685312] > [0.0,9.999E-6,0.,1.0008011776] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 1.5416524685312E13 > 0.0 0.9998 0.0 8011775.9 > 0.0 0.0 1.0 0.0 > 1.5416524685312E13 8011775.9 0.0 2.3766923337289844E26 > {code} > With forward solve for solving Q such that Q * R = A rather than computing > the inverse of R, it will output the following results instead: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,0.0] > [0.0,1.0,0.0,0.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.0] > [0.0,1.0E-5,1.0,1.0] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 0.0 > 0.0 1.0 0.0 0.0 > 0.0 0.0 1.0 0.0 > 0.0 0.0 0.0 1.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.
[ https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819941#comment-15819941 ] Apache Spark commented on SPARK-19184: -- User 'hl475' has created a pull request for this issue: https://github.com/apache/spark/pull/16556 > Improve numerical stability for method tallSkinnyQR. > > > Key: SPARK-19184 > URL: https://issues.apache.org/jira/browse/SPARK-19184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Huamin Li >Priority: Minor > Labels: None > > In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github > Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]). > When the upper triangular matrix R is ill-conditioned, computing the inverse > of R can result in catastrophic cancellation. Instead, we should consider > using a forward solve for solving Q such that Q * R = A. > I first create a 4 by 4 RowMatrix A = > (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method > tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this > case, A is ill-conditioned and so is R. > See codes in Spark Shell: > {code:none} > import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors} > import org.apache.spark.mllib.linalg.distributed.RowMatrix > // Create RowMatrix A. > val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), > Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14)) > val denseMat = new RowMatrix(sc.parallelize(mat, 2)) > // Apply tallSkinnyQR to A. > val result = denseMat.tallSkinnyQR(true) > // Print the calculated Q and R. > result.Q.rows.collect.foreach(println) > result.R > // Calculate Q*R. Ideally, this should be close to A. > val reconstruct = result.Q.multiply(result.R) > reconstruct.rows.collect.foreach(println) > // Calculate Q'*Q. Ideally, this should be close to the identity matrix. > result.Q.computeGramianMatrix() > System.exit(0) > {code} > it will output the following results: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,1.5416524685312E13] > [0.0,0.,0.0,8011776.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.15416524685312] > [0.0,9.999E-6,0.,1.0008011776] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 1.5416524685312E13 > 0.0 0.9998 0.0 8011775.9 > 0.0 0.0 1.0 0.0 > 1.5416524685312E13 8011775.9 0.0 2.3766923337289844E26 > {code} > With forward solve for solving Q such that Q * R = A rather than computing > the inverse of R, it will output the following results instead: > {code:none} > scala> result.Q.rows.collect.foreach(println) > [1.0,0.0,0.0,0.0] > [0.0,1.0,0.0,0.0] > [0.0,0.0,1.0,0.0] > [0.0,0.0,0.0,1.0] > scala> result.R > 1.0 1.0 1.0 1.0 > 0.0 1.0E-5 1.0 1.0 > 0.0 0.0 1.0E-10 1.0 > 0.0 0.0 0.0 1.0E-14 > scala> reconstruct.rows.collect.foreach(println) > [1.0,1.0,1.0,1.0] > [0.0,1.0E-5,1.0,1.0] > [0.0,0.0,1.0E-10,1.0] > [0.0,0.0,0.0,1.0E-14] > scala> result.Q.computeGramianMatrix() > 1.0 0.0 0.0 0.0 > 0.0 1.0 0.0 0.0 > 0.0 0.0 1.0 0.0 > 0.0 0.0 0.0 1.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15407) Floor division
[ https://issues.apache.org/jira/browse/SPARK-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819926#comment-15819926 ] Hyukjin Kwon commented on SPARK-15407: -- Hi [~hvanhovell], I just wonder if we should resolve this as {{Won't Fix}} or {{Not A Problem}}. > Floor division > -- > > Key: SPARK-15407 > URL: https://issues.apache.org/jira/browse/SPARK-15407 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Joao Ferreira > > I'm unable to perform floor division on DataFrame columns: > df.withColumn("d_date", df.d_date // 100) > Will produce: > TypeError: unsupported operand type(s) for //: 'Column' and 'int' > My column is of the int type. Basic operations such as > sum/division/subtraction work fine. > The inner workings of PySpark mention a floor function inside > sql/functions... but the actual code to perform this operation is missing. > Am I missing anything? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15251) Cannot apply PythonUDF to aggregated column
[ https://issues.apache.org/jira/browse/SPARK-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15251. -- Resolution: Cannot Reproduce {code} >>> def timesTwo(x): ... return x * 2 ... >>> sqlContext.udf.register("timesTwo", timesTwo) data = [(1, 'a'), (2, 'b')] rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, ["x", "y"]) df.registerTempTable("my_data") sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() >>> data = [(1, 'a'), (2, 'b')] >>> rdd = sc.parallelize(data) >>> df = sqlContext.createDataFrame(rdd, ["x", "y"]) >>> >>> df.registerTempTable("my_data") >>> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() +---+ | t| +---+ | 6| +---+ {code} It seems it is fixed somewhere and I can't reproduce this in the current master. It'd be great if this is backported if anyone can point out the PR > Cannot apply PythonUDF to aggregated column > --- > > Key: SPARK-15251 > URL: https://issues.apache.org/jira/browse/SPARK-15251 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Matthew Livesey > > In scala it is possible to define a UDF an apply it to an aggregated value in > an expression, for example: > {code} > def timesTwo(x: Int): Int = x * 2 > sqlContext.udf.register("timesTwo", timesTwo _) > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > case class Data(x: Int, y: String) > val data = List(Data(1, "a"), Data(2, "b")) > val rdd = sc.parallelize(data) > val df = rdd.toDF > df.registerTempTable("my_data") > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > +---+ > | t| > +---+ > | 6| > +---+ > {code} > Performing the same computation in pyspark: > {code} > def timesTwo(x): > return x * 2 > sqlContext.udf.register("timesTwo", timesTwo) > data = [(1, 'a'), (2, 'b')] > rdd = sc.parallelize(data) > df = sqlContext.createDataFrame(rdd, ["x", "y"]) > df.registerTempTable("my_data") > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > {code} > Gives the following: > {code} > AnalysisException: u"expression 'pythonUDF' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;" > {code} > Using a lambda rather than a named function gives the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14901) java exception when showing join
[ https://issues.apache.org/jira/browse/SPARK-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819920#comment-15819920 ] Hyukjin Kwon commented on SPARK-14901: -- Would this be possible to provide a self-contained reproducer? I can't reproduce this as below within Spark: {code} dispute_df = spark.range(10).toDF("COMMENTID") comments_df = spark.range(5).toDF("COMMENTID") dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first() {code} Do you mind if I ask where I can download Netezza? > java exception when showing join > > > Key: SPARK-14901 > URL: https://issues.apache.org/jira/browse/SPARK-14901 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Brent Elmer > > I am using pyspark with netezza. I am getting a java exception when trying > to show the first row of a join. I can show the first row for of the two > dataframes separately but not the result of a join. I get the same error for > any action I take(first, collect, show). Am I doing something wrong? > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > dispute_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table1', > driver='com.ibm.spark.netezza').load() > dispute_df.printSchema() > comments_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table2', > driver='com.ibm.spark.netezza').load() > comments_df.printSchema() > dispute_df.join(comments_df, dispute_df.COMMENTID == > comments_df.COMMENTID).first() > root > |-- COMMENTID: string (nullable = true) > |-- EXPORTDATETIME: timestamp (nullable = true) > |-- ARTAGS: string (nullable = true) > |-- POTAGS: string (nullable = true) > |-- INVTAG: string (nullable = true) > |-- ACTIONTAG: string (nullable = true) > |-- DISPUTEFLAG: string (nullable = true) > |-- ACTIONFLAG: string (nullable = true) > |-- CUSTOMFLAG1: string (nullable = true) > |-- CUSTOMFLAG2: string (nullable = true) > root > |-- COUNTRY: string (nullable = true) > |-- CUSTOMER: string (nullable = true) > |-- INVNUMBER: string (nullable = true) > |-- INVSEQNUMBER: string (nullable = true) > |-- LEDGERCODE: string (nullable = true) > |-- COMMENTTEXT: string (nullable = true) > |-- COMMENTTIMESTAMP: timestamp (nullable = true) > |-- COMMENTLENGTH: long (nullable = true) > |-- FREEINDEX: long (nullable = true) > |-- COMPLETEDFLAG: long (nullable = true) > |-- ACTIONFLAG: long (nullable = true) > |-- FREETEXT: string (nullable = true) > |-- USERNAME: string (nullable = true) > |-- ACTION: string (nullable = true) > |-- COMMENTID: string (nullable = true) > --- > Py4JJavaError Traceback (most recent call last) > in () > 5 comments_df = > sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', > user='***', password='***', dbtable='table2', > driver='com.ibm.spark.netezza').load() > 6 comments_df.printSchema() > > 7 dispute_df.join(comments_df, dispute_df.COMMENTID == > comments_df.COMMENTID).first() > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in first(self) > 802 Row(age=2, name=u'Alice') > 803 """ > --> 804 return self.head() > 805 > 806 @ignore_unicode_prefix > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in head(self, n) > 790 """ > 791 if n is None: > --> 792 rs = self.head(1) > 793 return rs[0] if rs else None > 794 return self.take(n) > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in head(self, n) > 792 rs = self.head(1) > 793 return rs[0] if rs else None > --> 794 return self.take(n) > 795 > 796 @ignore_unicode_prefix > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc > in take(self, num) > 304 with SCCallSiteSync(self._sc) as css: > 305 port = > self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( > --> 306 self._jdf, num) > 307 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 308 > /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 811 answer = self.gateway_client.send_command(command) > 812 return_value = get_return_value( > --> 813 answer, self.gateway_client,
[jira] [Created] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
Kalvin Chau created SPARK-19185: --- Summary: ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing Key: SPARK-19185 URL: https://issues.apache.org/jira/browse/SPARK-19185 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.2 Environment: Spark 2.0.2 Spark Streaming Kafka 010 Mesos 0.28.0 - client mode spark.executor.cores 1 spark.mesos.extra.cores 1 Reporter: Kalvin Chau We've been running into ConcurrentModificationExcpetions "KafkaConsumer is not safe for multi-threaded access" with the CachedKafkaConsumer. I've been working through debugging this issue and after looking through some of the spark source code I think this is a bug. Our set up is: Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using Spark-Streaming-Kafka-010 spark.executor.cores 1 spark.mesos.extra.cores 1 Batch interval: 10s, window interval: 180s, and slide interval: 30s We would see the exception when in one executor there are two task worker threads assigned the same Topic+Partition, but a different set of offsets. They would both get the same CachedKafkaConsumer, and whichever task thread went first would seek and poll for all the records, and at the same time the second thread would try to seek to its offset but fail because it is unable to acquire the lock. Time0 E0 Task0 - TopicPartition("abc", 0) X to Y Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z Time1 E0 Task0 - Seeks and starts to poll Time1 E0 Task1 - Attempts to seek, but fails Here are some relevant logs: {code} 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing topic test-topic, partition 2 offsets 4394204414 -> 4394238058 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing topic test-topic, partition 2 offsets 4394238058 -> 4394257712 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 4394204414 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 4394238058 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: Initial fetch for spark-executor-consumer test-topic 2 4394238058 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: Seeking to test-topic-2 4394238058 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting block rdd_199_2 failed due to an exception 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block rdd_199_2 could not be removed as it was not found on disk or in memory 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in task 49.0 in stage 45.0 (TID 3201) java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at
[jira] [Created] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.
Huamin Li created SPARK-19184: - Summary: Improve numerical stability for method tallSkinnyQR. Key: SPARK-19184 URL: https://issues.apache.org/jira/browse/SPARK-19184 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.0 Reporter: Huamin Li Priority: Minor In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]). When the upper triangular matrix R is ill-conditioned, computing the inverse of R can result in catastrophic cancellation. Instead, we should consider using a forward solve for solving Q such that Q * R = A. I first create a 4 by 4 RowMatrix A = (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this case, A is ill-conditioned and so is R. See codes in Spark Shell: {code:none} import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors} import org.apache.spark.mllib.linalg.distributed.RowMatrix // Create RowMatrix A. val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14)) val denseMat = new RowMatrix(sc.parallelize(mat, 2)) // Apply tallSkinnyQR to A. val result = denseMat.tallSkinnyQR(true) // Print the calculated Q and R. result.Q.rows.collect.foreach(println) result.R // Calculate Q*R. Ideally, this should be close to A. val reconstruct = result.Q.multiply(result.R) reconstruct.rows.collect.foreach(println) // Calculate Q'*Q. Ideally, this should be close to the identity matrix. result.Q.computeGramianMatrix() System.exit(0) {code} it will output the following results: {code:none} scala> result.Q.rows.collect.foreach(println) [1.0,0.0,0.0,1.5416524685312E13] [0.0,0.,0.0,8011776.0] [0.0,0.0,1.0,0.0] [0.0,0.0,0.0,1.0] scala> result.R 1.0 1.0 1.0 1.0 0.0 1.0E-5 1.0 1.0 0.0 0.0 1.0E-10 1.0 0.0 0.0 0.0 1.0E-14 scala> reconstruct.rows.collect.foreach(println) [1.0,1.0,1.0,1.15416524685312] [0.0,9.999E-6,0.,1.0008011776] [0.0,0.0,1.0E-10,1.0] [0.0,0.0,0.0,1.0E-14] scala> result.Q.computeGramianMatrix() 1.0 0.0 0.0 1.5416524685312E13 0.0 0.9998 0.0 8011775.9 0.0 0.0 1.0 0.0 1.5416524685312E13 8011775.9 0.0 2.3766923337289844E26 {code} With forward solve for solving Q such that Q * R = A rather than computing the inverse of R, it will output the following results instead: {code:none} scala> result.Q.rows.collect.foreach(println) [1.0,0.0,0.0,0.0] [0.0,1.0,0.0,0.0] [0.0,0.0,1.0,0.0] [0.0,0.0,0.0,1.0] scala> result.R 1.0 1.0 1.0 1.0 0.0 1.0E-5 1.0 1.0 0.0 0.0 1.0E-10 1.0 0.0 0.0 0.0 1.0E-14 scala> reconstruct.rows.collect.foreach(println) [1.0,1.0,1.0,1.0] [0.0,1.0E-5,1.0,1.0] [0.0,0.0,1.0E-10,1.0] [0.0,0.0,0.0,1.0E-14] scala> result.Q.computeGramianMatrix() 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13303) Spark fails with pandas import error when pandas is not explicitly imported by user
[ https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819894#comment-15819894 ] Hyukjin Kwon commented on SPARK-13303: -- +1 > Spark fails with pandas import error when pandas is not explicitly imported > by user > --- > > Key: SPARK-13303 > URL: https://issues.apache.org/jira/browse/SPARK-13303 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: The python installation used by the driver (edge node) > has pandas installed on it, while on the data nodes pandas do not have pandas > installed in the python runtimes used. Pandas is never explicitly imported by > pi.py. >Reporter: Juliet Hougland > > Running `spark-submit pi.py` results in: > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 98, in main > command = pickleSer._read_with_length(infile) > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 164, in _read_with_length > return self.loads(obj) > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 422, in loads > return pickle.loads(obj) > ImportError: No module named pandas.algos > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138) > at > org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > This is unexpected and hard for users to unravel why they may see this error, > as they themselves have not explicitly done anything with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819883#comment-15819883 ] Hyukjin Kwon commented on SPARK-12717: -- It still happens in the current master. > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Priority: Critical > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: > {code:title=abridge_run.txt|borderStyle=solid} > % spark-submit --master local[10] bug_spark.py --parallelism 20 > [snip] > 16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9) >
[jira] [Resolved] (SPARK-11428) Schema Merging Broken for Some Queries
[ https://issues.apache.org/jira/browse/SPARK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-11428. -- Resolution: Duplicate I am pretty sure that it is a duplicate of SPARK-11103. Please reopen this if anyone meets the same issue. > Schema Merging Broken for Some Queries > -- > > Key: SPARK-11428 > URL: https://issues.apache.org/jira/browse/SPARK-11428 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.5.1 > Environment: AWS, >Reporter: Brad Willard > Labels: dataframe, parquet, pyspark, schema, sparksql > > I have data being written into parquet format via spark streaming. The data > can change slightly so schema merging is required. I load a dataframe like > this > {code} > urls = [ > "/streaming/parquet/events/key=2015-10-30*", > "/streaming/parquet/events/key=2015-10-29*" > ] > sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls) > sdf.registerTempTable('events') > {code} > If I print the schema you can see the contested column > {code} > sdf.printSchema() > root > |-- _id: string (nullable = true) > ... > |-- d__device_s: string (nullable = true) > |-- d__isActualPageLoad_s: string (nullable = true) > |-- d__landing_s: string (nullable = true) > |-- d__lang_s: string (nullable = true) > |-- d__os_s: string (nullable = true) > |-- d__performance_i: long (nullable = true) > |-- d__product_s: string (nullable = true) > |-- d__refer_s: string (nullable = true) > |-- d__rk_i: long (nullable = true) > |-- d__screen_s: string (nullable = true) > |-- d__submenuName_s: string (nullable = true) > {code} > The column that's in one but not the other file is d__product_s > So I'm able to run this query and it works fine. > {code} > sql_context.sql(''' > select > distinct(d__product_s) > from > events > where > n = 'view' > ''').collect() > [Row(d__product_s=u'website'), > Row(d__product_s=u'store'), > Row(d__product_s=None), > Row(d__product_s=u'page')] > {code} > However if I instead use that column in the where clause things break. > {code} > sql_context.sql(''' > select > * > from > events > where > n = 'view' and d__product_s = 'page' > ''').take(1) > --- > Py4JJavaError Traceback (most recent call last) > in () > 6 where > 7 n = 'frontsite_view' and d__product_s = 'page' > > 8 ''').take(1) > /root/spark/python/pyspark/sql/dataframe.pyc in take(self, num) > 303 with SCCallSiteSync(self._sc) as css: > 304 port = > self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( > --> 305 self._jdf, num) > 306 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 307 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > /root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 34 def deco(*a, **kw): > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > 38 s = e.java_exception.toString() > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage > 15.0 (TID 6536, 10.X.X.X): java.lang.IllegalArgumentException: Column > [d__product_s] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at >
[jira] [Resolved] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
[ https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8128. - Resolution: Duplicate I am pretty sure that it duplicates SPARK-11103. Please reopen this if anyone meets the same problem. > Schema Merging Broken: Dataframe Fails to Recognize Column in Schema > > > Key: SPARK-8128 > URL: https://issues.apache.org/jira/browse/SPARK-8128 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Brad Willard > > I'm loading a folder of parquet files with about 600 parquet files and > loading it into one dataframe so schema merging is involved. There is some > bug with the schema merging that you print the schema and it shows all the > attributes. However when you run a query and filter on that attribute is > errors saying it's not in the schema. The query is incorrectly going to one > of the parquet files that does not have that attribute. > sdf = sql_context.parquet('/parquet/big_data_folder') > sdf.printSchema() > root > \|-- _id: string (nullable = true) > \|-- addedOn: string (nullable = true) > \|-- attachment: string (nullable = true) > ... > \|-- items: array (nullable = true) > \||-- element: struct (containsNull = true) > \|||-- _id: string (nullable = true) > \|||-- addedOn: string (nullable = true) > \|||-- authorId: string (nullable = true) > \|||-- mediaProcessingState: long (nullable = true) > \|-- mediaProcessingState: long (nullable = true) > \|-- title: string (nullable = true) > \|-- key: string (nullable = true) > sdf.filter(sdf.mediaProcessingState == 3).count() > causes this exception > Py4JJavaError: An error occurred while calling o67.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in > stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: > Column [mediaProcessingState] was not found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819851#comment-15819851 ] Weichen Xu commented on SPARK-10078: [~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of features in high efficiency ? If only the interface supporting distributed vector, but internal computation still use local vector and/or local matrix, then it seems won't make much sense... Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be computed in parallel ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19164) Review of UserDefinedFunction._broadcast
[ https://issues.apache.org/jira/browse/SPARK-19164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819820#comment-15819820 ] Reynold Xin commented on SPARK-19164: - Which one should I review? I see that you opened a bunch of WIP prs. > Review of UserDefinedFunction._broadcast > > > Key: SPARK-19164 > URL: https://issues.apache.org/jira/browse/SPARK-19164 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > It doesn't look like {{UserDefinedFunction._broadcast}} is used at all. If > this is a valid observation it could be remove with corresponding > {{\_\_del\_\_}} method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl"
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819716#comment-15819716 ] Xiaochen Ouyang commented on SPARK-19115: - May I ask you whether Spark supports the following comman or not:create external table if not exists gen_tbl like src_tbl location '/warehouse/data/gen_tbl' later version? Do you have a plan to support this command in the future? > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl" > --- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupported the command " create external table if not exist > new_tbl like old_tbl" > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19180: Assignee: Apache Spark > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai >Assignee: Apache Spark > Fix For: 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19180: Assignee: (was: Apache Spark) > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai > Fix For: 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819634#comment-15819634 ] Apache Spark commented on SPARK-19180: -- User 'yucai' has created a pull request for this issue: https://github.com/apache/spark/pull/16555 > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai > Fix For: 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819623#comment-15819623 ] Apache Spark commented on SPARK-19183: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/16554 > Add deleteWithJob hook to internal commit protocol API > -- > > Key: SPARK-19183 > URL: https://issues.apache.org/jira/browse/SPARK-19183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang > > Currently in SQL we implement overwrites by calling fs.delete() directly on > the original data. This is not ideal since we the original files end up > deleted even if the job aborts. We should extend the commit protocol to allow > file overwrites to be managed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19183: Assignee: (was: Apache Spark) > Add deleteWithJob hook to internal commit protocol API > -- > > Key: SPARK-19183 > URL: https://issues.apache.org/jira/browse/SPARK-19183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang > > Currently in SQL we implement overwrites by calling fs.delete() directly on > the original data. This is not ideal since we the original files end up > deleted even if the job aborts. We should extend the commit protocol to allow > file overwrites to be managed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-19183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19183: Assignee: Apache Spark > Add deleteWithJob hook to internal commit protocol API > -- > > Key: SPARK-19183 > URL: https://issues.apache.org/jira/browse/SPARK-19183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > > Currently in SQL we implement overwrites by calling fs.delete() directly on > the original data. This is not ideal since we the original files end up > deleted even if the job aborts. We should extend the commit protocol to allow > file overwrites to be managed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API
Eric Liang created SPARK-19183: -- Summary: Add deleteWithJob hook to internal commit protocol API Key: SPARK-19183 URL: https://issues.apache.org/jira/browse/SPARK-19183 Project: Spark Issue Type: Improvement Components: SQL Reporter: Eric Liang Currently in SQL we implement overwrites by calling fs.delete() directly on the original data. This is not ideal since we the original files end up deleted even if the job aborts. We should extend the commit protocol to allow file overwrites to be managed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs
[ https://issues.apache.org/jira/browse/SPARK-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19182: - Summary: Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs (was: Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs) > Optimize the lock in StreamingJobProgressListener to not block UI when > generating Streaming jobs > > > Key: SPARK-19182 > URL: https://issues.apache.org/jira/browse/SPARK-19182 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Shixiong Zhu >Priority: Minor > > When DStreamGraph is generating a job, it will hold a lock and block other > APIs. Because StreamingJobProgressListener (numInactiveReceivers, > streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to > access some information, the UI may hang if generating a job is very slow > (e.g., talking to the slow Kafka cluster to fetch metadata). > It's better to optimize the locks in DStreamGraph and > StreamingJobProgressListener to make the UI not block by job generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs
[ https://issues.apache.org/jira/browse/SPARK-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19182: - Description: When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener (numInactiveReceivers, streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata). It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation. was: When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata). It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation. > Optimize the lock in StreamingJobProgressListener to not block when > generating Streaming jobs > - > > Key: SPARK-19182 > URL: https://issues.apache.org/jira/browse/SPARK-19182 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Shixiong Zhu >Priority: Minor > > When DStreamGraph is generating a job, it will hold a lock and block other > APIs. Because StreamingJobProgressListener (numInactiveReceivers, > streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to > access some information, the UI may hang if generating a job is very slow > (e.g., talking to the slow Kafka cluster to fetch metadata). > It's better to optimize the locks in DStreamGraph and > StreamingJobProgressListener to make the UI not block by job generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19182) Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs
Shixiong Zhu created SPARK-19182: Summary: Optimize the lock in StreamingJobProgressListener to not block when generating Streaming jobs Key: SPARK-19182 URL: https://issues.apache.org/jira/browse/SPARK-19182 Project: Spark Issue Type: Improvement Components: DStreams Reporter: Shixiong Zhu Priority: Minor When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata). It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19132) Add test cases for row size estimation
[ https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-19132. - Resolution: Fixed Fix Version/s: 2.2.0 > Add test cases for row size estimation > -- > > Key: SPARK-19132 > URL: https://issues.apache.org/jira/browse/SPARK-19132 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > See https://github.com/apache/spark/pull/16430#discussion_r95040478 > getRowSize is mostly untested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819446#comment-15819446 ] Shivaram Venkataraman commented on SPARK-18823: --- Yeah I think it makes sense to not handle the case where we take a local vector. However adding support for `[` and `[[` to support literals and existing columns would be good. This is the only item remaining from what is summarized as #1 above I think ? > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819400#comment-15819400 ] yucai commented on SPARK-19180: --- Hi Owen, Thanks a lot for comments, it is using unsafe API for OffHeapColumn, which should have no align. See codes: {code} @Override public void putShorts(int rowId, int count, short value) { long offset = data + 2 * rowId; -for (int i = 0; i < count; ++i, offset += 4) { +for (int i = 0; i < count; ++i, offset += 2) { Platform.putShort(null, offset, value); } } {code} And also, my testing: {code} scala> val column = ColumnVector.allocate(1024, ShortType, MemoryMode.OFF_HEAP) column: org.apache.spark.sql.execution.vectorized.ColumnVector = org.apache.spark.sql.execution.vectorized.OffHeapColumnVector@56fc2cea scala> column.putShorts(0, 4, 8.toShort) scala> column.getShort(1) res5: Short = 18432 scala> scala> val column = ColumnVector.allocate(1024, ShortType, MemoryMode.ON_HEAP) column: org.apache.spark.sql.execution.vectorized.ColumnVector = org.apache.spark.sql.execution.vectorized.OnHeapColumnVector@7fb8d720 scala> column.putShorts(0, 4, 8.toShort) scala> column.getShort(1) res7: Short = 8 {code} > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai > Fix For: 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18801) Support resolve a nested view
[ https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18801. --- Resolution: Fixed Assignee: Jiang Xingbo > Support resolve a nested view > - > > Key: SPARK-18801 > URL: https://issues.apache.org/jira/browse/SPARK-18801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > > We should be able to resolve a nested view. The main advantage is that if you > update an underlying view, the current view also gets updated. > The new approach should be compatible with older versions of SPARK/HIVE, that > means: > 1. The new approach should be able to resolve the views that created by > older versions of SPARK/HIVE; > 2. The new approach should be able to resolve the views that are > currently supported by SPARK SQL. > The new approach mainly brings in the following changes: > 1. Add a new operator called `View` to keep track of the CatalogTable > that describes the view, and the output attributes as well as the child of > the view; > 2. Update the `ResolveRelations` rule to resolve the relations and > views, note that a nested view should be resolved correctly; > 3. Add `AnalysisContext` to enable us to still support a view created > with CTE/Windows query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19130) SparkR should support setting and adding new column with singular value implicitly
[ https://issues.apache.org/jira/browse/SPARK-19130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-19130. --- Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 2.2.0 2.1.1 Resolved by https://github.com/apache/spark/pull/16510 > SparkR should support setting and adding new column with singular value > implicitly > -- > > Key: SPARK-19130 > URL: https://issues.apache.org/jira/browse/SPARK-19130 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > for parity with framework like dplyr -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements
[ https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819260#comment-15819260 ] Shivaram Venkataraman commented on SPARK-19177: --- Thanks [~masip85] - Can you include a small code snippet that shows the problem ? > SparkR Data Frame operation between columns elements > > > Key: SPARK-19177 > URL: https://issues.apache.org/jira/browse/SPARK-19177 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 >Reporter: Vicente Masip >Priority: Minor > Labels: schema, sparkR, struct > > I have commented this in other thread, but I think it can be important to > clarify that: > What happen when you are working with 50 columns and gapply? Do I rewrite 50 > columns scheme with it's new column from gapply operation? I think there is > no alternative because structFields cannot be appended to structType. Any > suggestions? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.
[ https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819186#comment-15819186 ] Jose Soltren commented on SPARK-19181: -- SPARK-2208 disabled a similar metric previously. > SparkListenerSuite.local metrics fails when average executorDeserializeTime > is too short. > - > > Key: SPARK-19181 > URL: https://issues.apache.org/jira/browse/SPARK-19181 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren >Priority: Minor > > https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249 > The "local metrics" test asserts that tasks should take more than 1ms on > average to complete, even though a code comment notes that this is a small > test and tasks may finish faster. I've been seeing some "failures" here on > fast systems that finish these tasks quite quickly. > There are a few ways forward here: > 1. Disable this test. > 2. Relax this check. > 3. Implement sub-millisecond granularity for task times throughout Spark. > 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task > reference a partition that implements a custom Externalizable.readExternal, > which always waits 1ms before returning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.
Jose Soltren created SPARK-19181: Summary: SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short. Key: SPARK-19181 URL: https://issues.apache.org/jira/browse/SPARK-19181 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.0 Reporter: Jose Soltren Priority: Minor https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249 The "local metrics" test asserts that tasks should take more than 1ms on average to complete, even though a code comment notes that this is a small test and tasks may finish faster. I've been seeing some "failures" here on fast systems that finish these tasks quite quickly. There are a few ways forward here: 1. Disable this test. 2. Relax this check. 3. Implement sub-millisecond granularity for task times throughout Spark. 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task reference a partition that implements a custom Externalizable.readExternal, which always waits 1ms before returning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9435) Java UDFs don't work with GROUP BY expressions
[ https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9435: --- Assignee: Apache Spark > Java UDFs don't work with GROUP BY expressions > -- > > Key: SPARK-9435 > URL: https://issues.apache.org/jira/browse/SPARK-9435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: All >Reporter: James Aley >Assignee: Apache Spark > Attachments: IncMain.java, points.txt > > > If you define a UDF in Java, for example by implementing the UDF1 interface, > then try to use that UDF on a column in both the SELECT and GROUP BY clauses > of a query, you'll get an error like this: > {code} > "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)" > org.apache.spark.sql.AnalysisException: expression 'y' is neither present in > the group by, nor is it an aggregate function. Add to group by or wrap in > first() if you don't care which value you get. > {code} > We put together a minimal reproduction in the attached Java file, which makes > use of the data in the text file attached. > I'm guessing there's some kind of issue with the equality implementation, so > Spark can't tell that those two expressions are the same maybe? If you do the > same thing from Scala, it works fine. > Note for context: we ran into this issue while working around SPARK-9338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9435) Java UDFs don't work with GROUP BY expressions
[ https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9435: --- Assignee: (was: Apache Spark) > Java UDFs don't work with GROUP BY expressions > -- > > Key: SPARK-9435 > URL: https://issues.apache.org/jira/browse/SPARK-9435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: All >Reporter: James Aley > Attachments: IncMain.java, points.txt > > > If you define a UDF in Java, for example by implementing the UDF1 interface, > then try to use that UDF on a column in both the SELECT and GROUP BY clauses > of a query, you'll get an error like this: > {code} > "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)" > org.apache.spark.sql.AnalysisException: expression 'y' is neither present in > the group by, nor is it an aggregate function. Add to group by or wrap in > first() if you don't care which value you get. > {code} > We put together a minimal reproduction in the attached Java file, which makes > use of the data in the text file attached. > I'm guessing there's some kind of issue with the equality implementation, so > Spark can't tell that those two expressions are the same maybe? If you do the > same thing from Scala, it works fine. > Note for context: we ran into this issue while working around SPARK-9338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9435) Java UDFs don't work with GROUP BY expressions
[ https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819117#comment-15819117 ] Apache Spark commented on SPARK-9435: - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16553 > Java UDFs don't work with GROUP BY expressions > -- > > Key: SPARK-9435 > URL: https://issues.apache.org/jira/browse/SPARK-9435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: All >Reporter: James Aley > Attachments: IncMain.java, points.txt > > > If you define a UDF in Java, for example by implementing the UDF1 interface, > then try to use that UDF on a column in both the SELECT and GROUP BY clauses > of a query, you'll get an error like this: > {code} > "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)" > org.apache.spark.sql.AnalysisException: expression 'y' is neither present in > the group by, nor is it an aggregate function. Add to group by or wrap in > first() if you don't care which value you get. > {code} > We put together a minimal reproduction in the attached Java file, which makes > use of the data in the text file attached. > I'm guessing there's some kind of issue with the equality implementation, so > Spark can't tell that those two expressions are the same maybe? If you do the > same thing from Scala, it works fine. > Note for context: we ran into this issue while working around SPARK-9338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819073#comment-15819073 ] Sean Owen commented on SPARK-19180: --- Are you sure? most stuff is int aligned in the JVM. You might be right but just making sure it is not merely because a short is 2 bytes > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai > Fix For: 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819062#comment-15819062 ] Seth Hendrickson commented on SPARK-17136: -- I'm interested in working on this task including both driving the discussion and submitting an initial PR when it is time. I have the beginnings of a design document constructed [here|https://docs.google.com/document/d/1ynyTwlNw4b6DovG6m8okd3fD2PVZKCEq5rFfsg5Ba1k/edit?usp=sharing], and I'd like to open it up for community feedback and input. We do see requests from time to time for users to use their own optimizers in Spark ML algorithms and we have not supported it in Spark ML. With fairly minimal added code, we can make Spark ML optimizers pluggable which provides a tangible benefit to users. Potentially, we can design an API that has benefits beyond just that, and I'm interested to hear some of the other needs/wants people have. cc [~dbtsai] [~yanboliang] [~WeichenXu123] [~josephkb] [~srowen] > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
yucai created SPARK-19180: - Summary: the offset of short is 4 in OffHeapColumnVector's putShorts Key: SPARK-19180 URL: https://issues.apache.org/jira/browse/SPARK-19180 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: yucai Fix For: 2.2.0 the offset of short is 4 in OffHeapColumnVector's putShorts, actually it should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts
[ https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17568. Resolution: Fixed Assignee: Bryan Cutler Fix Version/s: 2.2.0 > Add spark-submit option for user to override ivy settings used to resolve > packages/artifacts > > > Key: SPARK-17568 > URL: https://issues.apache.org/jira/browse/SPARK-17568 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.2.0 > > > The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven > coordinates to package jars. Currently, the IvySettings are hard-coded with > Maven Central as the last repository in the chain of resolvers. > At IBM, we have heard from several enterprise clients that are frustrated > with lack of control over their local Spark installations. These clients want > to ensure that certain artifacts can be excluded or patched due to security > or license issues. For example, a package may use a vulnerable SSL protocol; > or a package may link against an AGPL library written by a litigious > competitor. > While additional repositories and exclusions can be added on the spark-submit > command line, this falls short of what is needed. With Maven Central always > as a fall-back repository, it is difficult to ensure only approved artifacts > are used and it is often the exclusions that site admins are not aware of > that can cause problems. Also, known exclusions are better handled through a > centralized managed repository rather than as command line arguments. > To resolve these issues, we propose the following change: allow the user to > specify an Ivy Settings XML file to pass in as an optional argument to > {{spark-submit}} (or specify in a config file) to define alternate > repositories used to resolve artifacts instead of the hard-coded defaults. > The use case for this would be to define a managed repository (such as Nexus) > in the settings file so that all requests for artifacts go through one > location only. > Example usage: > {noformat} > $SPARK_HOME/bin/spark-submit --conf > spark.ivy.settings=/path/to/ivysettings.xml myapp.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818849#comment-15818849 ] Dan commented on SPARK-18075: - If he is running into the same issue with spark-shell, which is one of the official ways to run a Spark application, then supposedly it is a real bug and fixing that would also fix the issue that occurs when running from an IDE :-) > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at > /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 > 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with
[jira] [Assigned] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode
[ https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19152: Assignee: Apache Spark > DataFrameWriter.saveAsTable should work with hive format with append mode > - > > Key: SPARK-19152 > URL: https://issues.apache.org/jira/browse/SPARK-19152 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode
[ https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818731#comment-15818731 ] Apache Spark commented on SPARK-19152: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/16552 > DataFrameWriter.saveAsTable should work with hive format with append mode > - > > Key: SPARK-19152 > URL: https://issues.apache.org/jira/browse/SPARK-19152 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19152) DataFrameWriter.saveAsTable should work with hive format with append mode
[ https://issues.apache.org/jira/browse/SPARK-19152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19152: Assignee: (was: Apache Spark) > DataFrameWriter.saveAsTable should work with hive format with append mode > - > > Key: SPARK-19152 > URL: https://issues.apache.org/jira/browse/SPARK-19152 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818720#comment-15818720 ] Sean Owen commented on SPARK-18075: --- Yes, spark-shell is submitted the same way. If you wrote some code that did its work given an existing SparkContext/SparkSession and then invoked it in the shell, it should be fine. I think this was about launching a Spark job by running a class directly as if it were any other program. That also can work, but, may require additional work to accomplish comparable setup. > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at >
[jira] [Commented] (SPARK-19090) Dynamic Resource Allocation not respecting spark.executor.cores
[ https://issues.apache.org/jira/browse/SPARK-19090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818716#comment-15818716 ] nirav patel commented on SPARK-19090: - [~q79969786] As I mentioned in previous comment it does work for me when I set parameter on command line but it doesn't work when I set it via SparkConf in my application class. > Dynamic Resource Allocation not respecting spark.executor.cores > --- > > Key: SPARK-19090 > URL: https://issues.apache.org/jira/browse/SPARK-19090 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2, 1.6.1, 2.0.1 >Reporter: nirav patel > > When enabling dynamic scheduling with yarn I see that all executors are using > only 1 core even if I specify "spark.executor.cores" to 6. If dynamic > scheduling is disabled then each executors will have 6 cores. i.e. it > respects "spark.executor.cores". I have tested this against spark 1.5 . I > think it will be the same behavior with 2.x as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818648#comment-15818648 ] Shuai Lin commented on SPARK-17101: --- Seems this issue has already been resolved by https://github.com/apache/spark/pull/14680 ? cc [~rxin] > Provide consistent format identifiers for TextFileFormat and ParquetFileFormat > -- > > Key: SPARK-17101 > URL: https://issues.apache.org/jira/browse/SPARK-17101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Define the format identifier that is used in {{Optimized Logical Plan}} in > {{explain}} for {{text}} file format. > {code} > scala> spark.read.text("people.csv").cache.explain(extended = true) > ... > == Optimized Logical Plan == > InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) >+- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > == Physical Plan == > InMemoryTableScan [value#24] >+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > When you {{explain}} csv format you can see {{Format: CSV}}. > {code} > scala> spark.read.csv("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Analyzed Logical Plan == > _c0: string, _c1: string, _c2: string, _c3: string > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Optimized Logical Plan == > InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) >+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: > CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > == Physical Plan == > InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] >+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) > +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, > Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > {code} > The custom format is defined for JSON, too. > {code} > scala> spark.read.json("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_corrupt_record#93] json > == Analyzed Logical Plan == > _corrupt_record: string > Relation[_corrupt_record#93] json > == Optimized Logical Plan == > InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) >+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > == Physical Plan == > InMemoryTableScan [_corrupt_record#93] >+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818645#comment-15818645 ] Nick Orka commented on SPARK-18075: --- This is really cool conversation, but how about if I run it in spark-shell. Is it supposed to make same setup as spark-submit does? The cool thing is that this doesn't work in spark-shell as well. > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at > /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 > 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 > MB >
[jira] [Created] (SPARK-19179) spark.yarn.access.namenodes description is wrong
Thomas Graves created SPARK-19179: - Summary: spark.yarn.access.namenodes description is wrong Key: SPARK-19179 URL: https://issues.apache.org/jira/browse/SPARK-19179 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.2 Reporter: Thomas Graves Priority: Minor The description and name of spark.yarn.access.namenodes is off. It says this is for HDFS namenodes when really this is to specify any hadoop filesystems. It gets the credentials for those filesystems. We should at least update the description on it to be more generic. We could change the name on it but we would have to deprecated it and keep around current name as many people use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files
[ https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818635#comment-15818635 ] roncenzhao commented on SPARK-19169: I have the two doubts: 1. In the method `HiveTableScanExec.addColumnMetadataToConf(conf)`, we set the `serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` into `hadoopConf`. 2. In the method `HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc)`, we set the table's properties which include `serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` into jobConf. I think it's the two points that cause this problem. When I remove this two methods, the sql will run successfully. I don't know why we must set the `serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` into jobConf. Thanks~ > columns changed orc table encouter 'IndexOutOfBoundsException' when read the > old schema files > - > > Key: SPARK-19169 > URL: https://issues.apache.org/jira/browse/SPARK-19169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: roncenzhao > > We hava an orc table called orc_test_tbl and hava inserted some data into it. > After that, we change the table schema by droping some columns. > When reading the old schema file, we get the follow exception. > ``` > java.lang.IndexOutOfBoundsException: toIndex = 65 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:962) > at java.util.ArrayList.subList(ArrayList.java:954) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ``` -- This message was
[jira] [Resolved] (SPARK-19021) Generailize HDFSCredentialProvider to support non HDFS security FS
[ https://issues.apache.org/jira/browse/SPARK-19021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-19021. --- Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.2.0 > Generailize HDFSCredentialProvider to support non HDFS security FS > -- > > Key: SPARK-19021 > URL: https://issues.apache.org/jira/browse/SPARK-19021 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.2.0 > > > Currently Spark can only get token renewal interval from security HDFS > (hdfs://), if Spark runs with other security file systems like webHDFS > (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get > token renewal intervals from these tokens. These will make Spark unable to > work with these security clusters. So instead of only checking HDFS token, we > should generalize to support different {{DelegationTokenIdentifier}}. > This is a follow-up work of SPARK-18840. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files
[ https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818599#comment-15818599 ] Sean Owen commented on SPARK-19169: --- It sounds like you're saying you read the data with the wrong schema on purpose (?) -- how does this relate to what Spark does? Or, why not let the ORC reader get the schema? > columns changed orc table encouter 'IndexOutOfBoundsException' when read the > old schema files > - > > Key: SPARK-19169 > URL: https://issues.apache.org/jira/browse/SPARK-19169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: roncenzhao > > We hava an orc table called orc_test_tbl and hava inserted some data into it. > After that, we change the table schema by droping some columns. > When reading the old schema file, we get the follow exception. > ``` > java.lang.IndexOutOfBoundsException: toIndex = 65 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:962) > at java.util.ArrayList.subList(ArrayList.java:954) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818594#comment-15818594 ] Sean Owen commented on SPARK-13198: --- I think that's up to you if you're interested in this? it's not clear what the issue is, but it's also not supported usage. > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818591#comment-15818591 ] Sean Owen commented on SPARK-18075: --- It's possible in many cases already and always has been. Obviously, the unit tests already work that way and are runnable from an IDE. There is no reason you can't, but also, no reason to expect that it all Just Works without doing some of the setup spark-submit may do for you. What if any of that is necessary depends on what one is running. > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at >
[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files
[ https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818552#comment-15818552 ] roncenzhao commented on SPARK-19169: I do not think this is a misusage of ORC. If we do not set the `serdeConstants.LIST_COLUMN_TYPES` and `serdeConstants.LIST_COLUMNS` of table schema into `hadoopConf` and let the orc reader get the schema info by reading orc file, we can run the sql successfully. > columns changed orc table encouter 'IndexOutOfBoundsException' when read the > old schema files > - > > Key: SPARK-19169 > URL: https://issues.apache.org/jira/browse/SPARK-19169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: roncenzhao > > We hava an orc table called orc_test_tbl and hava inserted some data into it. > After that, we change the table schema by droping some columns. > When reading the old schema file, we get the follow exception. > ``` > java.lang.IndexOutOfBoundsException: toIndex = 65 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:962) > at java.util.ArrayList.subList(ArrayList.java:954) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818540#comment-15818540 ] Wenchen Fan commented on SPARK-18075: - Although it's not a bug, I think this could be a very cool feature: running spark applications using IDE. I think we should read the spark-submit script to see how we launch a spark application and how can we do it with IDE. > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at > /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 > 16/10/19 19:37:53 INFO MemoryStore: MemoryStore
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818526#comment-15818526 ] Michael David Pedersen commented on SPARK-18075: I'm encountering this problem too, in the context of a custom RDD rather than UDFs, but similarly running the Spark driver as part of my application. This is actually a web application (to enable notebook-like work flows), so my motivation is not "just" to use a development setup. I'm pretty sure I've ruled out any obvious Spark version mismatches between the driver and the cluster. I would like to investigate this further. Any ideas of what the cause might be or where to start? > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19
[jira] [Comment Edited] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818498#comment-15818498 ] Dmytro Bielievtsov edited comment on SPARK-13198 at 1/11/17 2:38 PM: - [~srowen] Looks like a growing number of people needs this functionality. As some who knows the codebase, can you give a rough estimate of the amount of work it might take to make Spark guarantee a good cleanup, equivalent to the JVM shutdown? Or maybe one could hack this away by somehow restarting the corresponding JVM without exiting current python interpreter? If this is reasonable amount of work, I might try to cut out some of our team's time to work on the corresponding pull request. Thanks! was (Author: belevtsoff): [~srowen] Looks like a growing number of people needs this functionality. As some who knows the codebase, can you give a rough estimate of the amount of work it might take to make Spark guarantee a good cleanup, equivalent to the JVM shutdown? Or maybe one could hack this away by somehow restarting the corresponding JVM without exiting current python interpreter? If this is reasonable amount of work, I might try to cut out some of our team's time to work on the corresponding pull request. > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818498#comment-15818498 ] Dmytro Bielievtsov commented on SPARK-13198: [~srowen] Looks like a growing number of people needs this functionality. As some who knows the codebase, can you give a rough estimate of the amount of work it might take to make Spark guarantee a good cleanup, equivalent to the JVM shutdown? Or maybe one could hack this away by somehow restarting the corresponding JVM without exiting current python interpreter? If this is reasonable amount of work, I might try to cut out some of our team's time to work on the corresponding pull request. > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19132) Add test cases for row size estimation
[ https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818492#comment-15818492 ] Apache Spark commented on SPARK-19132: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/16551 > Add test cases for row size estimation > -- > > Key: SPARK-19132 > URL: https://issues.apache.org/jira/browse/SPARK-19132 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Zhenhua Wang > > See https://github.com/apache/spark/pull/16430#discussion_r95040478 > getRowSize is mostly untested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19132) Add test cases for row size estimation
[ https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19132: Assignee: Zhenhua Wang (was: Apache Spark) > Add test cases for row size estimation > -- > > Key: SPARK-19132 > URL: https://issues.apache.org/jira/browse/SPARK-19132 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Zhenhua Wang > > See https://github.com/apache/spark/pull/16430#discussion_r95040478 > getRowSize is mostly untested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19178) convert string of large numbers to int should return null
[ https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19178: Assignee: Wenchen Fan (was: Apache Spark) > convert string of large numbers to int should return null > - > > Key: SPARK-19178 > URL: https://issues.apache.org/jira/browse/SPARK-19178 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19132) Add test cases for row size estimation
[ https://issues.apache.org/jira/browse/SPARK-19132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19132: Assignee: Apache Spark (was: Zhenhua Wang) > Add test cases for row size estimation > -- > > Key: SPARK-19132 > URL: https://issues.apache.org/jira/browse/SPARK-19132 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > See https://github.com/apache/spark/pull/16430#discussion_r95040478 > getRowSize is mostly untested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19178) convert string of large numbers to int should return null
[ https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818477#comment-15818477 ] Apache Spark commented on SPARK-19178: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16550 > convert string of large numbers to int should return null > - > > Key: SPARK-19178 > URL: https://issues.apache.org/jira/browse/SPARK-19178 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19178) convert string of large numbers to int should return null
[ https://issues.apache.org/jira/browse/SPARK-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19178: Assignee: Apache Spark (was: Wenchen Fan) > convert string of large numbers to int should return null > - > > Key: SPARK-19178 > URL: https://issues.apache.org/jira/browse/SPARK-19178 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19178) convert string of large numbers to int should return null
Wenchen Fan created SPARK-19178: --- Summary: convert string of large numbers to int should return null Key: SPARK-19178 URL: https://issues.apache.org/jira/browse/SPARK-19178 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode
[ https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818366#comment-15818366 ] Apache Spark commented on SPARK-19151: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/16549 > DataFrameWriter.saveAsTable should work with hive format with overwrite mode > > > Key: SPARK-19151 > URL: https://issues.apache.org/jira/browse/SPARK-19151 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode
[ https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19151: Assignee: Apache Spark > DataFrameWriter.saveAsTable should work with hive format with overwrite mode > > > Key: SPARK-19151 > URL: https://issues.apache.org/jira/browse/SPARK-19151 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode
[ https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19151: Assignee: (was: Apache Spark) > DataFrameWriter.saveAsTable should work with hive format with overwrite mode > > > Key: SPARK-19151 > URL: https://issues.apache.org/jira/browse/SPARK-19151 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19175) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files
[ https://issues.apache.org/jira/browse/SPARK-19175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818344#comment-15818344 ] Sean Owen commented on SPARK-19175: --- No, continue on the JIRA I left open, SPARK-19169. This sounds like a misusage of ORC though. > columns changed orc table encouter 'IndexOutOfBoundsException' when read the > old schema files > - > > Key: SPARK-19175 > URL: https://issues.apache.org/jira/browse/SPARK-19175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: spark2.0.2 >Reporter: roncenzhao > > We hava an orc table called orc_test_tbl and hava inserted some data into it. > After that, we change the table schema by droping some columns. > When reading the old schema file, we get the follow exception. > But hive can read it successfully. > ``` > java.lang.IndexOutOfBoundsException: toIndex = 65 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:962) > at java.util.ArrayList.subList(ArrayList.java:954) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19158) ml.R example fails in yarn-cluster mode due to lacks of e1071 package
[ https://issues.apache.org/jira/browse/SPARK-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818300#comment-15818300 ] Apache Spark commented on SPARK-19158: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16548 > ml.R example fails in yarn-cluster mode due to lacks of e1071 package > - > > Key: SPARK-19158 > URL: https://issues.apache.org/jira/browse/SPARK-19158 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Yesha Vora > > ml.R application fails in spark2 with yarn-cluster mode. > {code} > spark-submit --master yarn-cluster examples/src/main/r/ml/ml.R {code} > {code:title=application log} > 17/01/03 04:35:30 INFO MemoryStore: Block broadcast_88 stored as values in > memory (estimated size 6.8 KB, free 407.6 MB) > 17/01/03 04:35:30 INFO BufferedStreamThread: Error : > requireNamespace("e1071", quietly = TRUE) is not TRUE > 17/01/03 04:35:30 ERROR Executor: Exception in task 0.0 in stage 65.0 (TID 65) > org.apache.spark.SparkException: R computation failed with > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 68 > 17/01/03 04:35:30 INFO Executor: Running task 3.0 in stage 65.0 (TID 68) > 17/01/03 04:35:30 INFO BufferedStreamThread: Error : > requireNamespace("e1071", quietly = TRUE) is not TRUE > 17/01/03 04:35:30 ERROR Executor: Exception in task 3.0 in stage 65.0 (TID 68) > org.apache.spark.SparkException: R computation failed with > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 70 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19158) ml.R example fails in yarn-cluster mode due to lacks of e1071 package
[ https://issues.apache.org/jira/browse/SPARK-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19158: Assignee: (was: Apache Spark) > ml.R example fails in yarn-cluster mode due to lacks of e1071 package > - > > Key: SPARK-19158 > URL: https://issues.apache.org/jira/browse/SPARK-19158 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Yesha Vora > > ml.R application fails in spark2 with yarn-cluster mode. > {code} > spark-submit --master yarn-cluster examples/src/main/r/ml/ml.R {code} > {code:title=application log} > 17/01/03 04:35:30 INFO MemoryStore: Block broadcast_88 stored as values in > memory (estimated size 6.8 KB, free 407.6 MB) > 17/01/03 04:35:30 INFO BufferedStreamThread: Error : > requireNamespace("e1071", quietly = TRUE) is not TRUE > 17/01/03 04:35:30 ERROR Executor: Exception in task 0.0 in stage 65.0 (TID 65) > org.apache.spark.SparkException: R computation failed with > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 68 > 17/01/03 04:35:30 INFO Executor: Running task 3.0 in stage 65.0 (TID 68) > 17/01/03 04:35:30 INFO BufferedStreamThread: Error : > requireNamespace("e1071", quietly = TRUE) is not TRUE > 17/01/03 04:35:30 ERROR Executor: Exception in task 3.0 in stage 65.0 (TID 68) > org.apache.spark.SparkException: R computation failed with > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > Error : requireNamespace("e1071", quietly = TRUE) is not TRUE > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 17/01/03 04:35:30 INFO CoarseGrainedExecutorBackend: Got assigned task 70 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org