[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207916#comment-15207916 ] Sun Rui commented on SPARK-14037: - spark 1.6.1 release, standalone mode. bin/sparkR --master spark:// run your code Go to the Spark web UI, in the application page for sparkR, Executor Summary ExecutorID Worker Cores Memory State Logs 0 worker-20160323135300-10.239.158.44-59572 12 1024RUNNING stdout stderr click to see the stderr > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > -- > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone >Reporter: Samuel Alexander > Labels: performance, sparkR > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207915#comment-15207915 ] Wenchen Fan commented on SPARK-10925: - If you wanna remove duplicated join keys, you can do `df1.join(df2, "key")`, and the result will only contain one key column. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Assigned] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14091: Assignee: Apache Spark > Consider improving performance of SparkContext.getCallSite() > > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan >Assignee: Apache Spark > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207914#comment-15207914 ] Apache Spark commented on SPARK-14091: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/11911 > Consider improving performance of SparkContext.getCallSite() > > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14091: Assignee: (was: Apache Spark) > Consider improving performance of SparkContext.getCallSite() > > > Key: SPARK-14091 > URL: https://issues.apache.org/jira/browse/SPARK-14091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Rajesh Balamohan > > Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). > {noformat} > private[spark] def getCallSite(): CallSite = { > val callSite = Utils.getCallSite() > CallSite( > > Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), > > Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) > ) > } > {noformat} > However, in some places utils.withDummyCallSite(sc) is invoked to avoid > expensive threaddumps within getCallSite(). But Utils.getCallSite() is > evaluated earlier causing threaddumps to be computed. This would impact when > lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs > are present, which can have significant impact when entire query runtime is > in the order of 10-20 seconds) > Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns
[ https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207877#comment-15207877 ] Wenchen Fan commented on SPARK-11231: - I'm not familiar with R or Spark R API, but for scala version, we can do `df1.join(df2, "key")`, and the result will only contain one key column. > join returns schema with duplicated and ambiguous join columns > -- > > Key: SPARK-11231 > URL: https://issues.apache.org/jira/browse/SPARK-11231 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: R >Reporter: Matt Pollock > > In the case where the key column of two data frames are named the same thing, > join returns a data frame where that column is duplicated. Since the content > of the columns is guaranteed to be the same by row consolidating the > identical columns into a single column would replicate standard R behavior[1] > and help prevent ambiguous names. > Example: > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > sdf1 <- createDataFrame(sqlContext, df1) > > sdf2 <- createDataFrame(sqlContext, df2) > > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner") > > schema(sjdf) > StructType > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value1", type = "DoubleType", nullable = TRUE > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value2", type = "DoubleType", nullable = TRUE > {code} > The duplicated key columns cause things like: > {code} > > library(magrittr) > > sjdf %>% select("key") > 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could > be: key#125, key#127.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399) > at org.apache.spark.sql.catalyst.tree > {code} > [1] In base R there is no"join", but a similar function "merge" is provided > in which a "by" argument identifies the shared key column in the two data > frames. In the case where the key column names differ "by.x" and "by.y" > arguments can be used. In the case of same-named key columns the > consolidation behavior requested above is observed. In the case of differing > names they "by.x" name is retained and consolidated with the "by.y" column > which is dropped. > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > merge(df1, df2, by="key") > key value1 value2 > 1 A 1 4 > 2 B 2 5 > 3 C 3 6 > df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3)) > > merge(df2, df3, by.x="key", by.y="akey") > key value2 value1 > 1 A 4 1 > 2 B 5 2 > 3 C 6 3 > > merge(df3, df2, by.x="akey", by.y="key") > akey value1 value2 > 1A 1 4 > 2B 2 5 > 3C 3 6 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build
[ https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207866#comment-15207866 ] Shivaram Venkataraman commented on SPARK-14074: --- [~sunrui] Would you have a chance to check if the tag 0.3.1 is good enough for us ? If so we can switch to that > Do not use install_github in SparkR build > - > > Key: SPARK-14074 > URL: https://issues.apache.org/jira/browse/SPARK-14074 > Project: Spark > Issue Type: Bug > Components: Build, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > In dev/lint-r.R, `install_github` makes our builds depend on a unstable > source. We should use official releases on CRAN instead even the released > version has less feature. > cc: [~shivaram] [~sunrui] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()
Rajesh Balamohan created SPARK-14091: Summary: Consider improving performance of SparkContext.getCallSite() Key: SPARK-14091 URL: https://issues.apache.org/jira/browse/SPARK-14091 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Rajesh Balamohan Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). {noformat} private[spark] def getCallSite(): CallSite = { val callSite = Utils.getCallSite() CallSite( Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) ) } {noformat} However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed. This would impact when lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs are present, which can have significant impact when entire query runtime is in the order of 10-20 seconds) Creating this jira to consider evaluating getCallSite only when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14085) Star Expansion for Hash
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14085: Description: Support star expansion in hash and concat. For example {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") {code} was: Support star expansion in hash and concat. For example {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") structDf.select(concat($"*")) {code} > Star Expansion for Hash > --- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14085) Star Expansion for Hash
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14085: Description: Support star expansion in hash. For example {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") {code} was: Support star expansion in hash and concat. For example {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") {code} > Star Expansion for Hash > --- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14085) Star Expansion for Hash
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14085: Summary: Star Expansion for Hash (was: Star Expansion for Hash and Concat) > Star Expansion for Hash > --- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > structDf.select(concat($"*")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10146. - Resolution: Fixed Fix Version/s: 2.0.0 > Have an easy way to set data source reader/writer specific confs > > > Key: SPARK-10146 > URL: https://issues.apache.org/jira/browse/SPARK-10146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > Fix For: 2.0.0 > > > Right now, it is hard to set data source reader/writer specifics confs > correctly (e.g. parquet's row group size). Users need to set those confs in > hadoop conf before start the application or through > {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be > great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207858#comment-15207858 ] Reynold Xin commented on SPARK-10146: - I think we are already doing this. I'm going to close the ticket. > Have an easy way to set data source reader/writer specific confs > > > Key: SPARK-10146 > URL: https://issues.apache.org/jira/browse/SPARK-10146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > Fix For: 2.0.0 > > > Right now, it is hard to set data source reader/writer specifics confs > correctly (e.g. parquet's row group size). Users need to set those confs in > hadoop conf before start the application or through > {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be > great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12769) Remove If expression
[ https://issues.apache.org/jira/browse/SPARK-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12769. --- Resolution: Won't Fix Closing as won't fix for now since doing this change would make the explain plan more confusing (if -> case). > Remove If expression > > > Key: SPARK-12769 > URL: https://issues.apache.org/jira/browse/SPARK-12769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > If can be a simple factory method for CaseWhen, similar to CaseKeyWhen. > We can then simplify the optimizer rules we implement for conditional > expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12767) Improve conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12767. - Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 2.0.0 > Improve conditional expressions > --- > > Key: SPARK-12767 > URL: https://issues.apache.org/jira/browse/SPARK-12767 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > There are a few improvements we can do to improve conditional expressions. > This ticket tracks them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12997) Use cast expression to perform type cast in csv
[ https://issues.apache.org/jira/browse/SPARK-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12997. --- Resolution: Not A Problem > Use cast expression to perform type cast in csv > --- > > Key: SPARK-12997 > URL: https://issues.apache.org/jira/browse/SPARK-12997 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > CSVTypeCast.castTo should probably be removed, and just replace its usage > with a projection that uses a sequence of Cast expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13401) Fix SQL test warnings
[ https://issues.apache.org/jira/browse/SPARK-13401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13401. - Resolution: Fixed Assignee: Yong Tang Fix Version/s: 2.0.0 > Fix SQL test warnings > - > > Key: SPARK-13401 > URL: https://issues.apache.org/jira/browse/SPARK-13401 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Reporter: holdenk >Assignee: Yong Tang >Priority: Trivial > Fix For: 2.0.0 > > > SQL tests have a few number of warnings about unreachable code, > non-exhaustive matches, and unchecked type casts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12855) Remove parser pluggability
[ https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207844#comment-15207844 ] Reynold Xin commented on SPARK-12855: - Got it - we can add this back, but we need to wait till we have the api changes in for Spark 2.0 in the next week or two. > Remove parser pluggability > -- > > Key: SPARK-12855 > URL: https://issues.apache.org/jira/browse/SPARK-12855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This pull request removes the public developer parser API for external > parsers. Given everything a parser depends on (e.g. logical plans and > expressions) are internal and not stable, external parsers will break with > every release of Spark. It is a bad idea to create the illusion that Spark > actually supports pluggable parsers. In addition, this also reduces > incentives for 3rd party projects to contribute parse improvements back to > Spark. > The number of applications that are using this feature is small (as far as I > know it came down from two to one as of Jan 2016, and will be 0 once we have > better ansi SQL support). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207843#comment-15207843 ] Reynold Xin commented on SPARK-14081: - Yes a pull request would be great. Probably an one line change. Can you also add a test case for them? Thanks! > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14072) Show JVM information when we run Benchmark
[ https://issues.apache.org/jira/browse/SPARK-14072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14072. - Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.0.0 > Show JVM information when we run Benchmark > -- > > Key: SPARK-14072 > URL: https://issues.apache.org/jira/browse/SPARK-14072 > Project: Spark > Issue Type: Improvement >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.0.0 > > > When we run a benchmark program, the result also shows processor information. > Since a version of JVM may also affect performance, it would be good to show > JVM version information. > Current: > {noformat} > model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > String Dictionary: Best/Avg Time(ms)Rate(M/s) Per > Row(ns) Relative > --- > SQL Parquet Vectorized693 / 740 15.1 > 66.1 1.0X > SQL Parquet MR 2501 / 2562 4.2 > 238.5 0.3X > {noformat} > Proposal: > {noformat} > model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > JVM information : IBM J9 VM, pxa6480sr2-20151023_01 (SR2) > String Dictionary: Best/Avg Time(ms)Rate(M/s) Per > Row(ns) Relative > --- > SQL Parquet Vectorized693 / 740 15.1 > 66.1 1.0X > SQL Parquet MR 2501 / 2562 4.2 > 238.5 0.3X > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14090) The optimization method of convex function
chenalong created SPARK-14090: - Summary: The optimization method of convex function Key: SPARK-14090 URL: https://issues.apache.org/jira/browse/SPARK-14090 Project: Spark Issue Type: Task Components: MLlib, Optimizer Affects Versions: 2.1.0 Reporter: chenalong Priority: Critical >From now, The optimization method of convex function in MLlib is not enough. >The SGD and ALS method are slow compare with bundle method.so can we realize >this method in spark? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207823#comment-15207823 ] Travis Crawford commented on SPARK-14081: - Agreed all data types should allow filling without changing their data type. From what I have observed only FloatType changes. Here's an example using the more specific fill that allows users to provide a replacement value map per column. Notice how just {{floatField}} changes its data type. {code} scala> :paste // Entering paste mode (ctrl-D to finish) val zeroFilledMapDF = df.na.fill(Map( "intField" -> 0, "longField" -> 0L, "floatField" -> 0f, "doubleField" -> 0d )) // Exiting paste mode, now interpreting. zeroFilledMapDF: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, floatField: double, doubleField: double] {code} If what I'm proposing sounds like the correct behavior I'll put together a change and send a pull request. It looks relatively self-contained, perhaps with some overzealous casting in {{fill0}} or {{fillCol}}. > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build
[ https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207780#comment-15207780 ] Sun Rui commented on SPARK-14074: - yes, unstable source may cause un-expected test failures. I agree we use a specifc github tag. We may periodically check if there is new CRAN release or new tag and update the source if needed. > Do not use install_github in SparkR build > - > > Key: SPARK-14074 > URL: https://issues.apache.org/jira/browse/SPARK-14074 > Project: Spark > Issue Type: Bug > Components: Build, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > In dev/lint-r.R, `install_github` makes our builds depend on a unstable > source. We should use official releases on CRAN instead even the released > version has less feature. > cc: [~shivaram] [~sunrui] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207773#comment-15207773 ] Apache Spark commented on SPARK-14089: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/11910 > Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x > > > Key: SPARK-14089 > URL: https://issues.apache.org/jira/browse/SPARK-14089 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14089: Assignee: Apache Spark > Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x > > > Key: SPARK-14089 > URL: https://issues.apache.org/jira/browse/SPARK-14089 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14089: Assignee: (was: Apache Spark) > Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x > > > Key: SPARK-14089 > URL: https://issues.apache.org/jira/browse/SPARK-14089 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
Liwei Lin created SPARK-14089: - Summary: Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x Key: SPARK-14089 URL: https://issues.apache.org/jira/browse/SPARK-14089 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core Affects Versions: 2.0.0 Reporter: Liwei Lin Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12855) Remove parser pluggability
[ https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207765#comment-15207765 ] Joseph Levin commented on SPARK-12855: -- Reynold - We would except grudgingly. Of course we don't want it to break each build but some churn we could live with. I guess part of my push back is I believe it is closing off one of the most powerful aspects of SQL on Spark. Writing an extensible parser is in itself a large undertaking. (I can only think of 2 others that have similar flexibility, Antlr, which can be implemented to be extensible but isn't fully out of the box and MS's Roslyn.) Marrying an extensible parser to Spark's distributed cross platform functionality is, as far as I have been able find, unique. For this project's initial work we didn't even need to be in the hadoop/big data space; our initial set of data sources all support jdbc. We did require a query an engine that could handle a single request to multiple data sources and give us the ability to rewrite the request on the fly. Spark is the only toolset we found that met both those needs. As a side note, it was the data bricks Deep Dive article on Catalyst that, I believe, you cowrote that led us to try Spark for this problem. > Remove parser pluggability > -- > > Key: SPARK-12855 > URL: https://issues.apache.org/jira/browse/SPARK-12855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This pull request removes the public developer parser API for external > parsers. Given everything a parser depends on (e.g. logical plans and > expressions) are internal and not stable, external parsers will break with > every release of Spark. It is a bad idea to create the illusion that Spark > actually supports pluggable parsers. In addition, this also reduces > incentives for 3rd party projects to contribute parse improvements back to > Spark. > The number of applications that are using this feature is small (as far as I > know it came down from two to one as of Jan 2016, and will be 0 once we have > better ansi SQL support). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12855) Remove parser pluggability
[ https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207729#comment-15207729 ] Reynold Xin commented on SPARK-12855: - Joseph - in the case of creating your own parser, you are essentially tying your implementation to the internals of Catalyst, and as a result might break with every release of Spark. Is that expected? > Remove parser pluggability > -- > > Key: SPARK-12855 > URL: https://issues.apache.org/jira/browse/SPARK-12855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This pull request removes the public developer parser API for external > parsers. Given everything a parser depends on (e.g. logical plans and > expressions) are internal and not stable, external parsers will break with > every release of Spark. It is a bad idea to create the illusion that Spark > actually supports pluggable parsers. In addition, this also reduces > incentives for 3rd party projects to contribute parse improvements back to > Spark. > The number of applications that are using this feature is small (as far as I > know it came down from two to one as of Jan 2016, and will be 0 once we have > better ansi SQL support). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12855) Remove parser pluggability
[ https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207725#comment-15207725 ] Joseph Levin commented on SPARK-12855: -- I have some concern about this task. I am working on a project at my company to upgrade how we do data management; in particular how maintain, stage, and deliver data to our customers and internal users. To do this we are using spark as the main component of our data access tier in large part because of the flexibility that the customizable parser gives us. In our case our primary use for the parser is not to create a completely new SQL like syntaxes but to rewrite data requests on the fly based on changes in the underlying store using the transform methods on the query plans. Essentially it is serving as a semantic layer for our data services. In the initial, and simplest, scenario we are using it to create queryable views of our backend data but we are also looking at to help address upcoming sharding needs and some more complex conditional views. Further down we wanted to evaluate it for constructing additional caching tiers and possibly for enforcing data access policies. For the later we were looking at some syntatic enhancements but so far they have all been in the ddl so aren't yet impacting the syntax we would use in the data requests. > Remove parser pluggability > -- > > Key: SPARK-12855 > URL: https://issues.apache.org/jira/browse/SPARK-12855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This pull request removes the public developer parser API for external > parsers. Given everything a parser depends on (e.g. logical plans and > expressions) are internal and not stable, external parsers will break with > every release of Spark. It is a bad idea to create the illusion that Spark > actually supports pluggable parsers. In addition, this also reduces > incentives for 3rd party projects to contribute parse improvements back to > Spark. > The number of applications that are using this feature is small (as far as I > know it came down from two to one as of Jan 2016, and will be 0 once we have > better ansi SQL support). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
[ https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207692#comment-15207692 ] Reynold Xin commented on SPARK-14081: - This is actually somewhat tricky, because we will lose information when the missing value (what user specifies) is being converted from double to float. In the case of 0 this is obviously not a problem, but in other cases it might be. Also it might be weird to do this only for float but not ints. What do you think? > DataFrameNaFunctions fill should not convert float fields to double > --- > > Key: SPARK-14081 > URL: https://issues.apache.org/jira/browse/SPARK-14081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Travis Crawford > > [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] > provides useful function for dealing with null values in a DataFrame. > Currently it changes FloatType columns to DoubleType when zero filling. Spark > should preserve the column data type. > In the following example, notice how `zeroFilledDF` has its `floatField` > converted from float to double. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = StructType(Seq( > StructField("intField", IntegerType), > StructField("longField", LongType), > StructField("floatField", FloatType), > StructField("doubleField", DoubleType))) > val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) > val df = sqlContext.createDataFrame(rdd, schema) > val zeroFilledDF = df.na.fill(0) > // Exiting paste mode, now interpreting. > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(intField,IntegerType,true), > StructField(longField,LongType,true), StructField(floatField,FloatType,true), > StructField(doubleField,DoubleType,true)) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[2] at parallelize at :48 > df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, > floatField: float, doubleField: double] > zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: > bigint, floatField: double, doubleField: double] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14088) Some Dataset API touch-up
[ https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14088: Assignee: Apache Spark (was: Reynold Xin) > Some Dataset API touch-up > - > > Key: SPARK-14088 > URL: https://issues.apache.org/jira/browse/SPARK-14088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > > 1. Deprecated unionAll. It is pretty confusing to have both "union" and > "unionAll" when the two do the same thing in Spark but are different in SQL. > 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more > consistent with rest of the functions in KeyValueGroupedDataset. Also makes > it more obvious what "reduce" and "reduceGroups" mean. Previously it was > confusing because it could be reducing a Dataset, or just reducing groups. > 3. Added a "name" function, which is more natural to name columns than "as" > for non-SQL users. > 4. Remove "subtract" function since it is just an alias for "except". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14088) Some Dataset API touch-up
[ https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14088: Assignee: Reynold Xin (was: Apache Spark) > Some Dataset API touch-up > - > > Key: SPARK-14088 > URL: https://issues.apache.org/jira/browse/SPARK-14088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > 1. Deprecated unionAll. It is pretty confusing to have both "union" and > "unionAll" when the two do the same thing in Spark but are different in SQL. > 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more > consistent with rest of the functions in KeyValueGroupedDataset. Also makes > it more obvious what "reduce" and "reduceGroups" mean. Previously it was > confusing because it could be reducing a Dataset, or just reducing groups. > 3. Added a "name" function, which is more natural to name columns than "as" > for non-SQL users. > 4. Remove "subtract" function since it is just an alias for "except". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14088) Some Dataset API touch-up
[ https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207674#comment-15207674 ] Apache Spark commented on SPARK-14088: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/11908 > Some Dataset API touch-up > - > > Key: SPARK-14088 > URL: https://issues.apache.org/jira/browse/SPARK-14088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > 1. Deprecated unionAll. It is pretty confusing to have both "union" and > "unionAll" when the two do the same thing in Spark but are different in SQL. > 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more > consistent with rest of the functions in KeyValueGroupedDataset. Also makes > it more obvious what "reduce" and "reduceGroups" mean. Previously it was > confusing because it could be reducing a Dataset, or just reducing groups. > 3. Added a "name" function, which is more natural to name columns than "as" > for non-SQL users. > 4. Remove "subtract" function since it is just an alias for "except". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14088) Some Dataset API touch-up
Reynold Xin created SPARK-14088: --- Summary: Some Dataset API touch-up Key: SPARK-14088 URL: https://issues.apache.org/jira/browse/SPARK-14088 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14088) Some Dataset API touch-up
[ https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14088: Description: 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. 4. Remove "subtract" function since it is just an alias for "except". was: 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. > Some Dataset API touch-up > - > > Key: SPARK-14088 > URL: https://issues.apache.org/jira/browse/SPARK-14088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > 1. Deprecated unionAll. It is pretty confusing to have both "union" and > "unionAll" when the two do the same thing in Spark but are different in SQL. > 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more > consistent with rest of the functions in KeyValueGroupedDataset. Also makes > it more obvious what "reduce" and "reduceGroups" mean. Previously it was > confusing because it could be reducing a Dataset, or just reducing groups. > 3. Added a "name" function, which is more natural to name columns than "as" > for non-SQL users. > 4. Remove "subtract" function since it is just an alias for "except". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14066) Set "spark.sql.dialect=sql", there is a problen in running query "select percentile(d,array(0,0.2,0.3,1)) as a from t;"
[ https://issues.apache.org/jira/browse/SPARK-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-14066: -- Description: In spark 1.5.1, I run "sh bin/spark-sql --conf spark.sql.dialect=sql", and run query "select percentile(d,array(0,0.2,0.3,1)) as a from t". There is a problem as follows. {code} spark-sql> select percentile(d,array(0,0.2,0.3,1)) as a from t; 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t 16/03/22 17:25:15 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select percentile(d,array(0,0.2,0.3,1)) as a from t] org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' due to data type mismatch: input to function array should all be the same type, but it's [int, decimal(1,1), decimal(1,1), int]; {code} was: In spark 1.5.1, I run "sh bin/spark-sql --conf spark.sql.dialect=sql", and run query "select percentile(d,array(0,0.2,0.3,1)) as a from t". There is a problem as follows. spark-sql> select percentile(d,array(0,0.2,0.3,1)) as a from t; 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t 16/03/22 17:25:15 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select percentile(d,array(0,0.2,0.3,1)) as a from t] org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' due to data type mismatch: input to function array should all be the same type, but it's [int, decimal(1,1), decimal(1,1), int]; > Set "spark.sql.dialect=sql", there is a problen in running query "select > percentile(d,array(0,0.2,0.3,1)) as a from t;" > - > > Key: SPARK-14066 > URL: https://issues.apache.org/jira/browse/SPARK-14066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: KaiXinXIaoLei > > In spark 1.5.1, I run "sh bin/spark-sql --conf spark.sql.dialect=sql", and > run query "select percentile(d,array(0,0.2,0.3,1)) as a from t". There is a > problem as follows. > {code} > spark-sql> select percentile(d,array(0,0.2,0.3,1)) as a from t; > 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t > 16/03/22 17:25:15 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table > : db=default tbl=t > 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select > percentile(d,array(0,0.2,0.3,1)) as a from t] > org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' > due to data type mismatch: input to function array should all be the same > type, but it's [int, decimal(1,1), decimal(1,1), int]; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14066) Set "spark.sql.dialect=sql", there is a problen in running query "select percentile(d,array(0,0.2,0.3,1)) as a from t;"
[ https://issues.apache.org/jira/browse/SPARK-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206280#comment-15206280 ] KaiXinXIaoLei edited comment on SPARK-14066 at 3/23/16 1:19 AM: In the org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion.FunctionArgumentConversion, I find in the value `findTightestCommonTypeOfTwo`: {code} case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) => Some(t2) case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) => Some(t1) {code} In `array(0,0.2,0.3,1)`, The type of `0` changes `DecimalType(10, 0)`, The type of `0.2` is `DecimalType(1, 1)`, so the value of `t2.isWiderThan(t1) ` is false. So the type of numbers will be [int, decimal(1,1), decimal(1,1), int]. And the query run failed. So I think the `TightestCommonTypeOfTwo` is not reasonable. Thanks. was (Author: kaixinxiaolei): In the org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion.FunctionArgumentConversion, I find in the value `findTightestCommonTypeOfTwo`: ``` case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) => Some(t2) case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) => Some(t1) ``` In `array(0,0.2,0.3,1)`, The type of `0` changes `DecimalType(10, 0)`, The type of `0.2` is `DecimalType(1, 1)`, so the value of `t2.isWiderThan(t1) ` is false. So the type of numbers will be [int, decimal(1,1), decimal(1,1), int]. And the query run failed. So I think the `TightestCommonTypeOfTwo` is not reasonable. Thanks. > Set "spark.sql.dialect=sql", there is a problen in running query "select > percentile(d,array(0,0.2,0.3,1)) as a from t;" > - > > Key: SPARK-14066 > URL: https://issues.apache.org/jira/browse/SPARK-14066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: KaiXinXIaoLei > > In spark 1.5.1, I run "sh bin/spark-sql --conf spark.sql.dialect=sql", and > run query "select percentile(d,array(0,0.2,0.3,1)) as a from t". There is a > problem as follows. > {code} > spark-sql> select percentile(d,array(0,0.2,0.3,1)) as a from t; > 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t > 16/03/22 17:25:15 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table > : db=default tbl=t > 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select > percentile(d,array(0,0.2,0.3,1)) as a from t] > org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' > due to data type mismatch: input to function array should all be the same > type, but it's [int, decimal(1,1), decimal(1,1), int]; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14033: -- Summary: Merging Estimator & Model (was: Merging Estimator, Model, & Transformer) > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14087: Assignee: (was: Apache Spark) > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207604#comment-15207604 ] Apache Spark commented on SPARK-14087: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/11906 > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14087: Assignee: Apache Spark > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13806) SQL round() produces incorrect results for negative values
[ https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13806. Resolution: Fixed Fix Version/s: 1.6.2 1.5.3 2.1.0 Issue resolved by pull request 11894 [https://github.com/apache/spark/pull/11894] > SQL round() produces incorrect results for negative values > -- > > Key: SPARK-13806 > URL: https://issues.apache.org/jira/browse/SPARK-13806 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Mark Hamstra >Assignee: Davies Liu > Fix For: 2.1.0, 1.5.3, 1.6.2 > > > Round in catalyst/expressions/mathExpressions.scala appears to be untested > with negative values, and it doesn't handle them correctly. > There are at least two issues here: > First, in the genCode for FloatType and DoubleType with _scale == 0, round() > will not produce the same results as for the BigDecimal.ROUND_HALF_UP > strategy used in all other cases. This is because Math.round is used for > these _scale == 0 cases. For example, Math.round(-3.5) is -3, while > BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. > Even after this bug is fixed with something like... > {code} > if (${ce.value} < 0) { > ${ev.value} = -1 * Math.round(-1 * ${ce.value}); > } else { > ${ev.value} = Math.round(${ce.value}); > } > {code} > ...which will allow an additional test like this to succeed in > MathFunctionsSuite.scala: > {code} > checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow) > {code} > ...there still appears to be a problem on at least the > checkEvalutionWithUnsafeProjection path, where failures like this are > produced: > {code} > Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: > [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207555#comment-15207555 ] Bryan Cutler commented on SPARK-14087: -- I can post a PR for this > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-14087: - Attachment: feature.py > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
Bryan Cutler created SPARK-14087: Summary: PySpark ML JavaModel does not properly own params after being fit Key: SPARK-14087 URL: https://issues.apache.org/jira/browse/SPARK-14087 Project: Spark Issue Type: Bug Components: ML, PySpark Reporter: Bryan Cutler Priority: Minor When a PySpark model is created after fitting data, its UID is initialized to the parent estimator's value. Before this assignment, any params defined in the model are copied from the object to the class in {{Params._copy_params()}} and assigned a different parent UID. This causes PySpark to think the params are not owned by the model and can lead to a {{ValueError}} raised from {{Params._shouldOwn()}}, such as: {noformat} ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', name='outputCol', doc='output column name.') does not belong to CountVectorizer_4c8e9fd539542d783e66. {noformat} I encountered this problem while working on SPARK-13967 where I tried to add the shared params {{HasInputCol}} and {{HasOutputCol}} to {{CountVectorizerModel}}. See the attached file feature.py for the WIP. Using the modified 'feature.py', this sample code shows the mixup in UIDs and produces the error above. {noformat} sc = SparkContext(appName="count_vec_test") sqlContext = SQLContext(sc) df = sqlContext.createDataFrame( [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", "raw"]) cv = CountVectorizer(inputCol="raw", outputCol="vectors") model = cv.fit(df) print(model.uid) for p in model.params: print(str(p)) model.transform(df).show(truncate=False) {noformat} output (the UIDs should match): {noformat} CountVectorizer_4c8e9fd539542d783e66 CountVectorizerModel_4336a81ba742b2593fef__binary CountVectorizerModel_4336a81ba742b2593fef__inputCol CountVectorizerModel_4336a81ba742b2593fef__outputCol {noformat} In the Scala implementation of this, the model overrides the UID value, which the Params use when they are constructed, so they all end up with the parent estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207537#comment-15207537 ] Joseph K. Bradley commented on SPARK-5991: -- Reopening since we'll need to add items once more Scala implementations are done > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-5991: -- > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5991: - Fix Version/s: (was: 2.0.0) > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14086) Add DDL commands to ANTLR4 Parser
[ https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14086: Assignee: (was: Apache Spark) > Add DDL commands to ANTLR4 Parser > - > > Key: SPARK-14086 > URL: https://issues.apache.org/jira/browse/SPARK-14086 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14086) Add DDL commands to ANTLR4 Parser
[ https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207528#comment-15207528 ] Apache Spark commented on SPARK-14086: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/11905 > Add DDL commands to ANTLR4 Parser > - > > Key: SPARK-14086 > URL: https://issues.apache.org/jira/browse/SPARK-14086 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14086) Add DDL commands to ANTLR4 Parser
[ https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14086: Assignee: Apache Spark > Add DDL commands to ANTLR4 Parser > - > > Key: SPARK-14086 > URL: https://issues.apache.org/jira/browse/SPARK-14086 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5991. -- Resolution: Fixed Fix Version/s: 2.0.0 > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.0.0 > > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14085) Star Expansion for Hash and Concat
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14085: Assignee: (was: Apache Spark) > Star Expansion for Hash and Concat > -- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > structDf.select(concat($"*")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14086) Add DDL commands to ANTLR4 Parser
Herman van Hovell created SPARK-14086: - Summary: Add DDL commands to ANTLR4 Parser Key: SPARK-14086 URL: https://issues.apache.org/jira/browse/SPARK-14086 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14085) Star Expansion for Hash and Concat
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207515#comment-15207515 ] Apache Spark commented on SPARK-14085: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11904 > Star Expansion for Hash and Concat > -- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > structDf.select(concat($"*")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14085) Star Expansion for Hash and Concat
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14085: Assignee: Apache Spark > Star Expansion for Hash and Concat > -- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > structDf.select(concat($"*")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14085) Star Expansion for Hash and Concat
[ https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14085: Description: Support star expansion in hash and concat. For example {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") structDf.select(concat($"*")) {code} was: To support star expansion, we can do it like, {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") structDf.select(concat($"*")) {code} > Star Expansion for Hash and Concat > -- > > Key: SPARK-14085 > URL: https://issues.apache.org/jira/browse/SPARK-14085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Support star expansion in hash and concat. For example > {code} > val structDf = testData2.select("a", "b").as("record") > structDf.select(hash($"*") > structDf.select(concat($"*")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14085) Star Expansion for Hash and Concat
Xiao Li created SPARK-14085: --- Summary: Star Expansion for Hash and Concat Key: SPARK-14085 URL: https://issues.apache.org/jira/browse/SPARK-14085 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li To support star expansion, we can do it like, {code} val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") structDf.select(concat($"*")) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14084) Parallel training jobs in model selection
Xiangrui Meng created SPARK-14084: - Summary: Parallel training jobs in model selection Key: SPARK-14084 URL: https://issues.apache.org/jira/browse/SPARK-14084 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.0 Reporter: Xiangrui Meng In CrossValidator and TrainValidationSplit, we run training jobs one by one. If users have a big cluster, they might see speed-ups if we parallelize the jobs. The trade-off is that we might need to make multiple copies of the training data, which could be expensive. It is worth testing and figure out the best way to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14084) Parallel training jobs in model selection
[ https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14084: -- Description: In CrossValidator and TrainValidationSplit, we run training jobs one by one. If users have a big cluster, they might see speed-ups if we parallelize the job submission on the driver. The trade-off is that we might need to make multiple copies of the training data, which could be expensive. It is worth testing and figure out the best way to implement it. (was: In CrossValidator and TrainValidationSplit, we run training jobs one by one. If users have a big cluster, they might see speed-ups if we parallelize the jobs. The trade-off is that we might need to make multiple copies of the training data, which could be expensive. It is worth testing and figure out the best way to implement it.) > Parallel training jobs in model selection > - > > Key: SPARK-14084 > URL: https://issues.apache.org/jira/browse/SPARK-14084 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > In CrossValidator and TrainValidationSplit, we run training jobs one by one. > If users have a big cluster, they might see speed-ups if we parallelize the > job submission on the driver. The trade-off is that we might need to make > multiple copies of the training data, which could be expensive. It is worth > testing and figure out the best way to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6717) Clear shuffle files after checkpointing in ALS
[ https://issues.apache.org/jira/browse/SPARK-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207459#comment-15207459 ] holdenk commented on SPARK-6717: So looking at the code a little bit I think its probably better to need to live in ALS rather than core. I don't think we can solve this for all checkpointing in general - when we checkpoint, if we are checkpointing a ShuffledRDD directly, its easy to register our shuffle files for cleanup but in the more general case (like the one in ALS) where we are checkpointing a subsequent RDD we don't know if its safe to cleanup the parents shuffle files (in general). We could expose something `checkPointAndEagerlyCleanParents` in the Core API but I think the chance of misuse is pretty high and it might be better to implement this inside of ML/ALS until there is a second request for this. > Clear shuffle files after checkpointing in ALS > -- > > Key: SPARK-6717 > URL: https://issues.apache.org/jira/browse/SPARK-6717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Labels: als > > In ALS iterations, we checkpoint RDDs to cut lineage and to reduce shuffle > files. However, whether to clean shuffle files depends on the system GC, > which may not be triggered in ALS iterations. So after checkpointing, before > we let the RDD object go out of scope, we should clean its shuffle > dependencies explicitly. This function could either stay inside ALS or go to > Core. > Without this feature, we can call System.gc() periodically to clean shuffle > files of RDDs that went out of scope. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI
[ https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207455#comment-15207455 ] Shixiong Zhu commented on SPARK-14079: -- It's already there. See "spark.sql.ui.retainedExecutions" in SQLListener > Limit the number of queries on SQL UI > - > > Key: SPARK-14079 > URL: https://issues.apache.org/jira/browse/SPARK-14079 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > The SQL UI become very very slow if there are hundreds of SQL queries on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13952) spark.ml GBT algs need to use random seed
[ https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13952: Assignee: (was: Apache Spark) > spark.ml GBT algs need to use random seed > - > > Key: SPARK-13952 > URL: https://issues.apache.org/jira/browse/SPARK-13952 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml. > There was one bug I found: The random seed is not used. A reasonable fix > will be to use the original seed to generate a new seed for each tree trained. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13952) spark.ml GBT algs need to use random seed
[ https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207444#comment-15207444 ] Apache Spark commented on SPARK-13952: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/11903 > spark.ml GBT algs need to use random seed > - > > Key: SPARK-13952 > URL: https://issues.apache.org/jira/browse/SPARK-13952 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml. > There was one bug I found: The random seed is not used. A reasonable fix > will be to use the original seed to generate a new seed for each tree trained. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14083: Description: One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we can more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions. Some examples are: {code} df.map(_.name) // equivalent to expression col("name") ds.groupBy(_.gender) // equivalent to expression col("gender") df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), lit(18) df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) {code} The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst). Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about. was: In the Dataset API, we are relying more on user-defined functions, which are typically slower than expressions because we can more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions. Some examples are: {code} df.map(_.name) // equivalent to expression col("name") ds.groupBy(_.gender) // equivalent to expression col("gender") df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), lit(18) df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) {code} The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst). Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we can more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13952) spark.ml GBT algs need to use random seed
[ https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13952: Assignee: Apache Spark > spark.ml GBT algs need to use random seed > - > > Key: SPARK-13952 > URL: https://issues.apache.org/jira/browse/SPARK-13952 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml. > There was one bug I found: The random seed is not used. A reasonable fix > will be to use the original seed to generate a new seed for each tree trained. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
Reynold Xin created SPARK-14083: --- Summary: Analyze JVM bytecode and turn closures into Catalyst expressions Key: SPARK-14083 URL: https://issues.apache.org/jira/browse/SPARK-14083 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin In the Dataset API, we are relying more on user-defined functions, which are typically slower than expressions because we can more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions. Some examples are: {code} df.map(_.name) // equivalent to expression col("name") ds.groupBy(_.gender) // equivalent to expression col("gender") df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), lit(18) df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) {code} The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst). Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207417#comment-15207417 ] Xusen Yin commented on SPARK-14041: --- [~mengxr] Maybe no need to divide them into several JIRAs, since what we need to do is deleting them. > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala *java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala *java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was:Please go through the current example code and list possible duplicates. > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > *java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13019: Assignee: Xin Ren (was: Apache Spark) > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207399#comment-15207399 ] Apache Spark commented on SPARK-13019: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/11901 > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14082) Add support for GPU resource when running on Mesos
Timothy Chen created SPARK-14082: Summary: Add support for GPU resource when running on Mesos Key: SPARK-14082 URL: https://issues.apache.org/jira/browse/SPARK-14082 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen As Mesos is integrating GPU as a first class resource, Spark can benefit by allowing frameworks to launch their jobs with GPU and using the GPU information provided by Mesos to discover/run their jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11666) Find the best `k` by cutting bisecting k-means cluster tree without recomputation
[ https://issues.apache.org/jira/browse/SPARK-11666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207390#comment-15207390 ] Burak KĂ–SE commented on SPARK-11666: Hi, can you share links for references about that? > Find the best `k` by cutting bisecting k-means cluster tree without > recomputation > - > > Key: SPARK-11666 > URL: https://issues.apache.org/jira/browse/SPARK-11666 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > For example, scikit-learn's hierarchical clustering support a feature to > extract partial tree from the result. We should support a feature like that > in order to reduce compute cost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double
Travis Crawford created SPARK-14081: --- Summary: DataFrameNaFunctions fill should not convert float fields to double Key: SPARK-14081 URL: https://issues.apache.org/jira/browse/SPARK-14081 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Travis Crawford [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala] provides useful function for dealing with null values in a DataFrame. Currently it changes FloatType columns to DoubleType when zero filling. Spark should preserve the column data type. In the following example, notice how `zeroFilledDF` has its `floatField` converted from float to double. {code} scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql._ import org.apache.spark.sql.types._ val schema = StructType(Seq( StructField("intField", IntegerType), StructField("longField", LongType), StructField("floatField", FloatType), StructField("doubleField", DoubleType))) val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null))) val df = sqlContext.createDataFrame(rdd, schema) val zeroFilledDF = df.na.fill(0) // Exiting paste mode, now interpreting. import org.apache.spark.sql._ import org.apache.spark.sql.types._ schema: org.apache.spark.sql.types.StructType = StructType(StructField(intField,IntegerType,true), StructField(longField,LongType,true), StructField(floatField,FloatType,true), StructField(doubleField,DoubleType,true)) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[2] at parallelize at :48 df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, floatField: float, doubleField: double] zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, floatField: double, doubleField: double] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14080) Improve the codegen for Filter
[ https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14080: Assignee: (was: Apache Spark) > Improve the codegen for Filter > -- > > Key: SPARK-14080 > URL: https://issues.apache.org/jira/browse/SPARK-14080 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Bo Meng >Priority: Minor > > Currently, the codegen of null check for Filter sometime generates code as > followings: > /* 072 */ if (!(!(filter_isNull2))) continue; > It will be better to be: > /* 072 */ if (filter_isNull2) continue; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14080) Improve the codegen for Filter
[ https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207383#comment-15207383 ] Apache Spark commented on SPARK-14080: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11900 > Improve the codegen for Filter > -- > > Key: SPARK-14080 > URL: https://issues.apache.org/jira/browse/SPARK-14080 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Bo Meng >Priority: Minor > > Currently, the codegen of null check for Filter sometime generates code as > followings: > /* 072 */ if (!(!(filter_isNull2))) continue; > It will be better to be: > /* 072 */ if (filter_isNull2) continue; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14080) Improve the codegen for Filter
[ https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14080: Assignee: Apache Spark > Improve the codegen for Filter > -- > > Key: SPARK-14080 > URL: https://issues.apache.org/jira/browse/SPARK-14080 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > Currently, the codegen of null check for Filter sometime generates code as > followings: > /* 072 */ if (!(!(filter_isNull2))) continue; > It will be better to be: > /* 072 */ if (filter_isNull2) continue; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14080) Improve the codegen for Filter
Bo Meng created SPARK-14080: --- Summary: Improve the codegen for Filter Key: SPARK-14080 URL: https://issues.apache.org/jira/browse/SPARK-14080 Project: Spark Issue Type: Improvement Components: SQL Reporter: Bo Meng Priority: Minor Currently, the codegen of null check for Filter sometime generates code as followings: /* 072 */ if (!(!(filter_isNull2))) continue; It will be better to be: /* 072 */ if (filter_isNull2) continue; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI
[ https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207349#comment-15207349 ] Davies Liu commented on SPARK-14079: Yes, that's what I meant. > Limit the number of queries on SQL UI > - > > Key: SPARK-14079 > URL: https://issues.apache.org/jira/browse/SPARK-14079 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > The SQL UI become very very slow if there are hundreds of SQL queries on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren reopened SPARK-13019: - need to fix scala-2.10 compile > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13449) Naive Bayes wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13449. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11890 [https://github.com/apache/spark/pull/11890] > Naive Bayes wrapper in SparkR > - > > Key: SPARK-13449 > URL: https://issues.apache.org/jira/browse/SPARK-13449 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > Fix For: 2.0.0 > > > Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's > naive Bayes implementation is from package e1071 with signature: > {code} > ## S3 method for class 'formula' > naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) > ## Default S3 method, which we don't want to support > # naiveBayes(x, y, laplace = 0, ...) > ## S3 method for class 'naiveBayes' > predict(object, newdata, > type = c("class", "raw"), threshold = 0.001, eps = 0, ...) > {code} > It should be easy for us to match the parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe
[ https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207294#comment-15207294 ] Sunitha Kambhampati commented on SPARK-14040: - I can reproduce this on my master ( v2.0 snapshot, synced to today). I tried the first scenario from the description. > Null-safe and equality join produces incorrect result with filtered dataframe > - > > Key: SPARK-14040 > URL: https://issues.apache.org/jira/browse/SPARK-14040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Ubuntu Linux 15.10 >Reporter: Denton Cockburn > > Initial issue reported here: > http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results > val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") > val a = b.where("c = 1").withColumnRenamed("a", > "filta").withColumnRenamed("b", "filtb") > a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> > b("c"), "left_outer").show > Produces 2 rows instead of the expected 1. > a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === > $"b" and $"newc" === b("c"), "left_outer").show > Also produces 2 rows instead of the expected 1. > The only one that seemed to work correctly was: > a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === > b("c"), "left_outer").show > But that produced a warning for : > WARN Column: Constructing trivially true equals predicate, 'c#18232 = > c#18232' > As pointed out by commenter zero323: > "The second behavior looks indeed like a bug related to the fact that you > still have a.c in your data. It looks like it is picked downstream before b.c > and the evaluated condition is actually a.newc = a.c" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI
[ https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207275#comment-15207275 ] Andrew Or commented on SPARK-14079: --- we should do `maxRetainedQueries` or something, similar to what we already do for the All Jobs page. > Limit the number of queries on SQL UI > - > > Key: SPARK-14079 > URL: https://issues.apache.org/jira/browse/SPARK-14079 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > The SQL UI become very very slow if there are hundreds of SQL queries on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14075) Refactor MemoryStore to be testable independent of BlockManager
[ https://issues.apache.org/jira/browse/SPARK-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207266#comment-15207266 ] Apache Spark commented on SPARK-14075: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11899 > Refactor MemoryStore to be testable independent of BlockManager > --- > > Key: SPARK-14075 > URL: https://issues.apache.org/jira/browse/SPARK-14075 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > It would be nice to refactor the MemoryStore so that it can be unit-tested > without constructing a full BlockManager or needing to mock tons of things. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn
[ https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236 ] Satish Kolli edited comment on SPARK-13514 at 3/22/16 8:37 PM: --- I just upgraded the shuffle service to 1.6.1 and the *YARN node managers* have the following. I used the same code I used in my original post to test it. {code} ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => /..:7337] java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at
[jira] [Comment Edited] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn
[ https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236 ] Satish Kolli edited comment on SPARK-13514 at 3/22/16 8:34 PM: --- I just upgraded the shuffle service to 1.6.1 and the *YARN node managers* have the following. I used the same code I used in my original post to test it. {code} ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => /..:7337] java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at
[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207239#comment-15207239 ] JESSE CHEN commented on SPARK-13864: Tried on two recent builds having issues running to completion. Something is broken. Looking into why... > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Miller] > [LJNCBAAA,George,Gamez] > [LIOP,Derek,Allen] > [EADJ,Ruth,Carroll] > [JGMM,Richard,Larson] > [PKIK,Wendy,Horvath] > [FJHF,Larissa,Roy] > [EPOG,Felisha,Mendes] > [EKJL,Aisha,Carlson] > [HNFH,Rebecca,Wilson] > [IBFCBAAA,Ruth,Grantham] > [OPDL,Ann,Pence] > [NIPL,Eric,Lawrence] > [OCIC,Zachary,Pennington] > [OFLC,James,Taylor] > [GEHI,Tyler,Miller] > [CADP,Cristobal,Thomas] > [JIAL,Santos,Gutierrez] > [PMMBBAAA,Paul,Jordan] > [DIIO,David,Carroll] > [DFKABAAA,Latoya,Craft] > [HMOI,Grace,Henderson] > [PPIBBAAA,Candice,Lee] > [JONHBAAA,Warren,Orozco] > [GNDA,Terry,Mcdowell] > [CIJM,Elizabeth,Thomas] > [DIJGBAAA,Ruth,Sanders] > [NFBDBAAA,Vernice,Fernandez] > [IDKF,Michael,Mack] > [IMHB,Kathy,Knowles] > [LHMC,Brooke,Nelson] > [CFCGBAAA,Marcus,Sanders] > [NJHCBAAA,Christopher,Schreiber] > [PDFB,Terrance,Banks] > [ANFA,Philip,Banks] > [IADEBAAA,Diane,Aldridge] > [ICHF,Linda,Mccoy] > [CFEN,Christopher,Dawson] > [KOJJ,Gracie,Mendoza] > [FOJA,Don,Castillo] > [FGPG,Albert,Wadsworth] > [KJBK,Georgia,Scott] > [EKFP,Annika,Chin] > [IBAEBAAA,Sandra,Wilson] > [MFFL,Margret,Gray] > [KNAK,Gladys,Banks] > [CJDI,James,Kerr] > [OBADBAAA,Elizabeth,Burnham] > [AMGD,Kenneth,Harlan] > [HJLA,Audrey,Beltran] > [AOPFBAAA,Jerry,Fields] > [CNAGBAAA,Virginia,May] > [HGOABAAA,Sonia,White] > [KBCABAAA,Debra,Bell] > [NJAG,Allen,Hood] > [MMOBBAAA,Margaret,Smith] > [NGDBBAAA,Carlos,Jewell] > [FOGI,Michelle,Greene] > [JEKFBAAA,Norma,Burkholder] > [OCAJ,Jenna,Staton] > [PFCL,Felicia,Neville] > [DLHBBAAA,Henry,Bertrand] > [DBEFBAAA,Bennie,Bowers] > [DCKO,Robert,Gonzalez] > [KKGE,Katie,Dunbar] > [GFMDBAAA,Kathleen,Gibson] > [IJEM,Charlie,Cummings] > [KJBL,Kerry,Davis] > [JKBN,Julie,Kern] > [MDCA,Louann,Hamel] > [EOAK,Molly,Benjamin] > [IBHH,Jennifer,Ballard] > [PJEN,Ashley,Norton] > [KLHHBAAA,Manuel,Castaneda] > [IMHHBAAA,Lillian,Davidson] > [GHPBBAAA,Nick,Mendez] > [BNBB,Irma,Smith] > [FBAH,Michael,Williams] > [PEHEBAAA,Edith,Molina] > [FMHI,Emilio,Darling] > [KAEC,Milton,Mackey] > [OCDJ,Nina,Sanchez] > [FGIG,Eduardo,Miller] > [FHACBAAA,null,null] > [HMJN,Ryan,Baptiste] > [HHCABAAA,William,Stewart] > {noformat} > Expected results: > {noformat} > +--+-++ > | CUSTOMER_ID | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME | > +--+-++ > | AMGD | Kenneth | Harlan | > | ANFA | Philip | Banks | > | AOPFBAAA | Jerry | Fields | > | BLEIBAAA | Paula | Wakefield | > | BNBB | Irma| Smith | > | CADP | Cristobal | Thomas | > | CFCGBAAA | Marcus | Sanders
[jira] [Commented] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn
[ https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236 ] Satish Kolli commented on SPARK-13514: -- I just upgraded the shuffle service with 1.6.1 and the *YARN node managers* have the following. I used the same code I used in my original post {code} ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => /..:7337] java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) at
[jira] [Updated] (SPARK-14079) Limit the number of queries on SQL UI
[ https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14079: --- Description: The SQL UI become very very slow if there are hundreds of SQL queries on it. > Limit the number of queries on SQL UI > - > > Key: SPARK-14079 > URL: https://issues.apache.org/jira/browse/SPARK-14079 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > The SQL UI become very very slow if there are hundreds of SQL queries on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI
[ https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207231#comment-15207231 ] Davies Liu commented on SPARK-14079: cc [~zsxwing] [~andrewor14] > Limit the number of queries on SQL UI > - > > Key: SPARK-14079 > URL: https://issues.apache.org/jira/browse/SPARK-14079 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > The SQL UI become very very slow if there are hundreds of SQL queries on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13887) PyLint should fail fast to make errors easier to discover
[ https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207227#comment-15207227 ] Apache Spark commented on SPARK-13887: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/11898 > PyLint should fail fast to make errors easier to discover > - > > Key: SPARK-13887 > URL: https://issues.apache.org/jira/browse/SPARK-13887 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Reporter: holdenk >Priority: Minor > > Right now our PyLint script runs all of the checks and then returns the > output, this can make it difficult to find the part which errored and > complicates the script a bit. We can simplify out script to fail fast which > will both simplify the script and make it easier to discover the errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14079) Limit the number of queries on SQL UI
Davies Liu created SPARK-14079: -- Summary: Limit the number of queries on SQL UI Key: SPARK-14079 URL: https://issues.apache.org/jira/browse/SPARK-14079 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13887) PyLint should fail fast to make errors easier to discover
[ https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13887: Assignee: (was: Apache Spark) > PyLint should fail fast to make errors easier to discover > - > > Key: SPARK-13887 > URL: https://issues.apache.org/jira/browse/SPARK-13887 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Reporter: holdenk >Priority: Minor > > Right now our PyLint script runs all of the checks and then returns the > output, this can make it difficult to find the part which errored and > complicates the script a bit. We can simplify out script to fail fast which > will both simplify the script and make it easier to discover the errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13887) PyLint should fail fast to make errors easier to discover
[ https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13887: Assignee: Apache Spark > PyLint should fail fast to make errors easier to discover > - > > Key: SPARK-13887 > URL: https://issues.apache.org/jira/browse/SPARK-13887 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > Right now our PyLint script runs all of the checks and then returns the > output, this can make it difficult to find the part which errored and > complicates the script a bit. We can simplify out script to fail fast which > will both simplify the script and make it easier to discover the errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13733) Support initial weight distribution in personalized PageRank
[ https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207219#comment-15207219 ] Gayathri Murali edited comment on SPARK-13733 at 3/22/16 8:28 PM: -- [~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - mentions one key difference between page rank and personalized page rank as: "In PageRank, every node has an initial score of 1, whereas for Personalized PageRank, only source node has a score of 1 and others have a score of 0 at the beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we set 1 only to the seed node. 1. For this JIRA, do we want to instead set an initial distribution of weights such that all nodes receive non-zero initial values? Could you clarify if this is the behavior that is intended? 2. What is the idea behind having an initial weight distribution for other vertices in personalized page rank? was (Author: gayathrimurali): [~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - mentions one key difference between page rank and personalized page rank as: "In PageRank, every node has an initial score of 1, whereas for Personalized PageRank, only source node has a score of 1 and others have a score of 0 at the beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we set 1 only to the seed node. For this JIRA, do we want to instead set an initial distribution of weights such that all nodes receive non-zero initial values? Could you clarify if this is the behavior that is intended? > Support initial weight distribution in personalized PageRank > > > Key: SPARK-13733 > URL: https://issues.apache.org/jira/browse/SPARK-13733 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Xiangrui Meng > > It would be nice to support personalized PageRank with an initial weight > distribution besides a single vertex. It should be easy to modify the current > implementation to add this support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank
[ https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207219#comment-15207219 ] Gayathri Murali commented on SPARK-13733: - [~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - mentions one key difference between page rank and personalized page rank as: "In PageRank, every node has an initial score of 1, whereas for Personalized PageRank, only source node has a score of 1 and others have a score of 0 at the beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we set 1 only to the seed node. For this JIRA, do we want to instead set an initial distribution of weights such that all nodes receive non-zero initial values? Could you clarify if this is the behavior that is intended? > Support initial weight distribution in personalized PageRank > > > Key: SPARK-13733 > URL: https://issues.apache.org/jira/browse/SPARK-13733 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Xiangrui Meng > > It would be nice to support personalized PageRank with an initial weight > distribution besides a single vertex. It should be easy to modify the current > implementation to add this support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13858. -- Resolution: Not A Bug Schema updates generated correct results in both spark 1.6 and 2.0. Good to close. > TPCDS query 21 returns wrong results compared to TPC official result set > - > > Key: SPARK-13858 > URL: https://issues.apache.org/jira/browse/SPARK-13858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make.
[jira] [Commented] (SPARK-13971) Implicit group by with distinct modifier on having raises an unexpected error
[ https://issues.apache.org/jira/browse/SPARK-13971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207168#comment-15207168 ] Sunitha Kambhampati commented on SPARK-13971: - fwiw, it is not the exact same env but I tried the same query against the master ( v2.0 snapshot) and it worked fine using the sqlContext. > Implicit group by with distinct modifier on having raises an unexpected error > - > > Key: SPARK-13971 > URL: https://issues.apache.org/jira/browse/SPARK-13971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark standalone mode installed on Centos7 >Reporter: Javier PĂ©rez > > 1. Start-thriftserver > 2. connect with beeline > 3 perform the following query over a simple talbe: > SELECT COUNT(DISTINCT field1) FROM test_table HAVING COUNT(DISTINCT field1) = > 3 > TRACE: > ERROR SparkExecuteStatementOperation: Error running hive query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: resolved attribute(s) > gid#13616,field1#13617 missing from > field1#13612,field2#13611,field2#13608,field3#13610,field4#13613,field5#13609 > in operator !Expand [List(null, 0, if ((gid#13616 = 1)) field1#13617 else > null),List(field2#13608, 1, null)], [field2#13619,gid#13618,if ((gid = 1)) > field1 else null#13620]; > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:246) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org