[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

2016-03-22 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207916#comment-15207916
 ] 

Sun Rui commented on SPARK-14037:
-

spark 1.6.1 release, standalone mode.
bin/sparkR --master spark://
run your code

Go to the Spark web UI, in the application page for sparkR, 

Executor Summary
ExecutorID  Worker  Cores   Memory  State   Logs
0   worker-20160323135300-10.239.158.44-59572   12  1024RUNNING 
stdout stderr

click to see the stderr

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2016-03-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207915#comment-15207915
 ] 

Wenchen Fan commented on SPARK-10925:
-

If you wanna remove duplicated join keys, you can do `df1.join(df2, "key")`, 
and the result will only contain one key column.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Assigned] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14091:


Assignee: Apache Spark

> Consider improving performance of SparkContext.getCallSite()
> 
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207914#comment-15207914
 ] 

Apache Spark commented on SPARK-14091:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11911

> Consider improving performance of SparkContext.getCallSite()
> 
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14091:


Assignee: (was: Apache Spark)

> Consider improving performance of SparkContext.getCallSite()
> 
>
> Key: SPARK-14091
> URL: https://issues.apache.org/jira/browse/SPARK-14091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>
> Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
> {noformat}
>   private[spark] def getCallSite(): CallSite = {
> val callSite = Utils.getCallSite()
> CallSite(
>   
> Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
>   
> Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
> )
>   }
> {noformat}
> However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
> expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
> evaluated earlier causing threaddumps to be computed.  This would impact when 
> lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
> are present, which can have significant impact when entire query runtime is 
> in the order of 10-20 seconds)
> Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

2016-03-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207877#comment-15207877
 ] 

Wenchen Fan commented on SPARK-11231:
-

I'm not familiar with R or Spark R API, but for scala version, we can do 
`df1.join(df2, "key")`, and the result will only contain one key column.

> join returns schema with duplicated and ambiguous join columns
> --
>
> Key: SPARK-11231
> URL: https://issues.apache.org/jira/browse/SPARK-11231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: R
>Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, 
> join returns a data frame where that column is duplicated. Since the content 
> of the columns is guaranteed to be the same by row consolidating the 
> identical columns into a single column would replicate standard R behavior[1] 
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
> be: key#125, key#127.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
>   at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided 
> in which a "by" argument identifies the shared key column in the two data 
> frames. In the case where the key column names differ "by.x" and "by.y" 
> arguments can be used. In the case of same-named key columns the 
> consolidation behavior requested above is observed. In the case of differing 
> names they "by.x" name is retained and consolidated with the "by.y" column 
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A  1  4
> 2   B  2  5
> 3   C  3  6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A  4  1
> 2   B  5  2
> 3   C  6  3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1A  1  4
> 2B  2  5
> 3C  3  6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build

2016-03-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207866#comment-15207866
 ] 

Shivaram Venkataraman commented on SPARK-14074:
---

[~sunrui] Would you have a chance to check if the tag 0.3.1 is good enough for 
us ? If so we can switch to that

> Do not use install_github in SparkR build
> -
>
> Key: SPARK-14074
> URL: https://issues.apache.org/jira/browse/SPARK-14074
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In dev/lint-r.R, `install_github` makes our builds depend on a unstable 
> source. We should use official releases on CRAN instead even the released 
> version has less feature.
> cc: [~shivaram] [~sunrui]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-22 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14091:


 Summary: Consider improving performance of 
SparkContext.getCallSite()
 Key: SPARK-14091
 URL: https://issues.apache.org/jira/browse/SPARK-14091
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Rajesh Balamohan


Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().

{noformat}
  private[spark] def getCallSite(): CallSite = {
val callSite = Utils.getCallSite()
CallSite(
  
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
  Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
)
  }
{noformat}

However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
evaluated earlier causing threaddumps to be computed.  This would impact when 
lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
are present, which can have significant impact when entire query runtime is in 
the order of 10-20 seconds)

Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash

2016-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14085:

Description: 
Support star expansion in hash and concat. For example
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
{code}

  was:
Support star expansion in hash and concat. For example
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
structDf.select(concat($"*"))
{code}


> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash

2016-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14085:

Description: 
Support star expansion in hash. For example
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
{code}

  was:
Support star expansion in hash and concat. For example
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
{code}


> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash

2016-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14085:

Summary: Star Expansion for Hash  (was: Star Expansion for Hash and Concat)

> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> structDf.select(concat($"*"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10146.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Have an easy way to set data source reader/writer specific confs
> 
>
> Key: SPARK-10146
> URL: https://issues.apache.org/jira/browse/SPARK-10146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
>
> Right now, it is hard to set data source reader/writer specifics confs 
> correctly (e.g. parquet's row group size). Users need to set those confs in 
> hadoop conf before start the application or through 
> {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
> great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2016-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207858#comment-15207858
 ] 

Reynold Xin commented on SPARK-10146:
-

I think we are already doing this. I'm going to close the ticket.


> Have an easy way to set data source reader/writer specific confs
> 
>
> Key: SPARK-10146
> URL: https://issues.apache.org/jira/browse/SPARK-10146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
>
> Right now, it is hard to set data source reader/writer specifics confs 
> correctly (e.g. parquet's row group size). Users need to set those confs in 
> hadoop conf before start the application or through 
> {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
> great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12769) Remove If expression

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12769.
---
Resolution: Won't Fix

Closing as won't fix for now since doing this change would make the explain 
plan more confusing (if -> case).


> Remove If expression
> 
>
> Key: SPARK-12769
> URL: https://issues.apache.org/jira/browse/SPARK-12769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If can be a simple factory method for CaseWhen, similar to CaseKeyWhen.
> We can then simplify the optimizer rules we implement for conditional 
> expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12767) Improve conditional expressions

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12767.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> Improve conditional expressions
> ---
>
> Key: SPARK-12767
> URL: https://issues.apache.org/jira/browse/SPARK-12767
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> There are a few improvements we can do to improve conditional expressions. 
> This ticket tracks them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12997) Use cast expression to perform type cast in csv

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12997.
---
Resolution: Not A Problem

> Use cast expression to perform type cast in csv
> ---
>
> Key: SPARK-12997
> URL: https://issues.apache.org/jira/browse/SPARK-12997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> CSVTypeCast.castTo should probably be removed, and just replace its usage 
> with a projection that uses a sequence of Cast expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13401) Fix SQL test warnings

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13401.
-
   Resolution: Fixed
 Assignee: Yong Tang
Fix Version/s: 2.0.0

> Fix SQL test warnings
> -
>
> Key: SPARK-13401
> URL: https://issues.apache.org/jira/browse/SPARK-13401
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Reporter: holdenk
>Assignee: Yong Tang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> SQL tests have a few number of warnings about unreachable code, 
> non-exhaustive matches, and unchecked type casts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12855) Remove parser pluggability

2016-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207844#comment-15207844
 ] 

Reynold Xin commented on SPARK-12855:
-

Got it - we can add this back, but we need to wait till we have the api changes 
in for Spark 2.0 in the next week or two.


> Remove parser pluggability
> --
>
> Key: SPARK-12855
> URL: https://issues.apache.org/jira/browse/SPARK-12855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This pull request removes the public developer parser API for external 
> parsers. Given everything a parser depends on (e.g. logical plans and 
> expressions) are internal and not stable, external parsers will break with 
> every release of Spark. It is a bad idea to create the illusion that Spark 
> actually supports pluggable parsers. In addition, this also reduces 
> incentives for 3rd party projects to contribute parse improvements back to 
> Spark.
> The number of applications that are using this feature is small (as far as I 
> know it came down from two to one as of Jan 2016, and will be 0 once we have 
> better ansi SQL support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207843#comment-15207843
 ] 

Reynold Xin commented on SPARK-14081:
-

Yes a pull request would be great. Probably an one line change. Can you also 
add a test case for them? Thanks!


> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14072) Show JVM information when we run Benchmark

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14072.
-
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.0.0

> Show JVM information when we run Benchmark
> --
>
> Key: SPARK-14072
> URL: https://issues.apache.org/jira/browse/SPARK-14072
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.0.0
>
>
> When we run a benchmark program, the result also shows processor information. 
> Since a version of JVM may also affect performance, it would be good to show 
> JVM version information.
> Current:
> {noformat}
> model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> String Dictionary:  Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> ---
> SQL Parquet Vectorized693 /  740 15.1  
> 66.1   1.0X
> SQL Parquet MR   2501 / 2562  4.2 
> 238.5   0.3X
> {noformat}
> Proposal:
> {noformat}
> model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> JVM information : IBM J9 VM, pxa6480sr2-20151023_01 (SR2)
> String Dictionary:  Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> ---
> SQL Parquet Vectorized693 /  740 15.1  
> 66.1   1.0X
> SQL Parquet MR   2501 / 2562  4.2 
> 238.5   0.3X
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14090) The optimization method of convex function

2016-03-22 Thread chenalong (JIRA)
chenalong created SPARK-14090:
-

 Summary: The optimization method of convex function
 Key: SPARK-14090
 URL: https://issues.apache.org/jira/browse/SPARK-14090
 Project: Spark
  Issue Type: Task
  Components: MLlib, Optimizer
Affects Versions: 2.1.0
Reporter: chenalong
Priority: Critical


>From now, The optimization method of convex function in MLlib is not enough. 
>The SGD and ALS method are slow compare with bundle method.so can we realize 
>this method in spark?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-22 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207823#comment-15207823
 ] 

Travis Crawford commented on SPARK-14081:
-

Agreed all data types should allow filling without changing their data type. 
From what I have observed only FloatType changes. Here's an example using the 
more specific fill that allows users to provide a replacement value map per 
column. Notice how just {{floatField}} changes its data type.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val zeroFilledMapDF = df.na.fill(Map(
  "intField" -> 0,
  "longField" -> 0L,
  "floatField" -> 0f,
  "doubleField" -> 0d
))

// Exiting paste mode, now interpreting.

zeroFilledMapDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
bigint, floatField: double, doubleField: double]
{code}

If what I'm proposing sounds like the correct behavior I'll put together a 
change and send a pull request. It looks relatively self-contained, perhaps 
with some overzealous casting in {{fill0}} or {{fillCol}}.

> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14074) Do not use install_github in SparkR build

2016-03-22 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207780#comment-15207780
 ] 

Sun Rui commented on SPARK-14074:
-

yes, unstable source may cause un-expected test failures. I agree we use a 
specifc github tag. We may periodically check if there is new CRAN release or 
new tag and update the source if needed.

> Do not use install_github in SparkR build
> -
>
> Key: SPARK-14074
> URL: https://issues.apache.org/jira/browse/SPARK-14074
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In dev/lint-r.R, `install_github` makes our builds depend on a unstable 
> source. We should use official releases on CRAN instead even the released 
> version has less feature.
> cc: [~shivaram] [~sunrui]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207773#comment-15207773
 ] 

Apache Spark commented on SPARK-14089:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11910

> Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
> 
>
> Key: SPARK-14089
> URL: https://issues.apache.org/jira/browse/SPARK-14089
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14089:


Assignee: Apache Spark

> Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
> 
>
> Key: SPARK-14089
> URL: https://issues.apache.org/jira/browse/SPARK-14089
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14089:


Assignee: (was: Apache Spark)

> Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x
> 
>
> Key: SPARK-14089
> URL: https://issues.apache.org/jira/browse/SPARK-14089
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14089) Remove methods that has been deprecated since 1.1.x, 1.2.x and 1.3.x

2016-03-22 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-14089:
-

 Summary: Remove methods that has been deprecated since 1.1.x, 
1.2.x and 1.3.x
 Key: SPARK-14089
 URL: https://issues.apache.org/jira/browse/SPARK-14089
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12855) Remove parser pluggability

2016-03-22 Thread Joseph Levin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207765#comment-15207765
 ] 

Joseph Levin commented on SPARK-12855:
--

Reynold - We would except grudgingly.  Of course we don't want it to break each 
build but some churn we could live with.  I guess part of my push back is I 
believe it is closing off one of the most powerful aspects of SQL on Spark.  
Writing an extensible parser is in itself a large undertaking.  (I can only 
think of 2 others that have similar flexibility, Antlr, which can be 
implemented to be extensible but isn't fully out of the box and MS's Roslyn.)  
Marrying an extensible parser to Spark's distributed cross platform 
functionality is, as far as I have been able find, unique.  For this project's 
initial work we didn't even need to be in the hadoop/big data space; our 
initial set of data sources all support jdbc.  We did require a query an engine 
that could handle a single request to multiple data sources and give us the 
ability to rewrite the request on the fly.  Spark is the only toolset we found 
that met both those needs.

As a side note, it was the data bricks Deep Dive article on Catalyst that, I 
believe, you cowrote that led us to try Spark for this problem.

> Remove parser pluggability
> --
>
> Key: SPARK-12855
> URL: https://issues.apache.org/jira/browse/SPARK-12855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This pull request removes the public developer parser API for external 
> parsers. Given everything a parser depends on (e.g. logical plans and 
> expressions) are internal and not stable, external parsers will break with 
> every release of Spark. It is a bad idea to create the illusion that Spark 
> actually supports pluggable parsers. In addition, this also reduces 
> incentives for 3rd party projects to contribute parse improvements back to 
> Spark.
> The number of applications that are using this feature is small (as far as I 
> know it came down from two to one as of Jan 2016, and will be 0 once we have 
> better ansi SQL support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12855) Remove parser pluggability

2016-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207729#comment-15207729
 ] 

Reynold Xin commented on SPARK-12855:
-

Joseph - in the case of creating your own parser, you are essentially tying 
your implementation to the internals of Catalyst, and as a result might break 
with every release of Spark. Is that expected?


> Remove parser pluggability
> --
>
> Key: SPARK-12855
> URL: https://issues.apache.org/jira/browse/SPARK-12855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This pull request removes the public developer parser API for external 
> parsers. Given everything a parser depends on (e.g. logical plans and 
> expressions) are internal and not stable, external parsers will break with 
> every release of Spark. It is a bad idea to create the illusion that Spark 
> actually supports pluggable parsers. In addition, this also reduces 
> incentives for 3rd party projects to contribute parse improvements back to 
> Spark.
> The number of applications that are using this feature is small (as far as I 
> know it came down from two to one as of Jan 2016, and will be 0 once we have 
> better ansi SQL support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12855) Remove parser pluggability

2016-03-22 Thread Joseph Levin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207725#comment-15207725
 ] 

Joseph Levin commented on SPARK-12855:
--

I have some concern about this task.  I am working on a project at my company 
to upgrade how we do data management; in particular how maintain, stage, and 
deliver data to our customers and internal users.  To do this we are using 
spark as the main component of our data access tier in large part because of 
the flexibility that the customizable parser gives us.  In our case our primary 
use for the parser is not to create a completely new SQL like syntaxes but to 
rewrite data requests on the fly based on changes in the underlying store using 
the transform methods on the query plans.  Essentially it is serving as a 
semantic layer for our data services.  In the initial, and simplest, scenario 
we are using it to create queryable views of our backend data but we are also 
looking at to help address upcoming sharding needs and some more complex 
conditional views.  Further down we wanted to evaluate it for constructing 
additional caching tiers and possibly for enforcing data access policies.  For 
the later we were looking at some syntatic enhancements but so far they have 
all been in the ddl so aren't yet impacting the syntax we would use in the data 
requests.

> Remove parser pluggability
> --
>
> Key: SPARK-12855
> URL: https://issues.apache.org/jira/browse/SPARK-12855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This pull request removes the public developer parser API for external 
> parsers. Given everything a parser depends on (e.g. logical plans and 
> expressions) are internal and not stable, external parsers will break with 
> every release of Spark. It is a bad idea to create the illusion that Spark 
> actually supports pluggable parsers. In addition, this also reduces 
> incentives for 3rd party projects to contribute parse improvements back to 
> Spark.
> The number of applications that are using this feature is small (as far as I 
> know it came down from two to one as of Jan 2016, and will be 0 once we have 
> better ansi SQL support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207692#comment-15207692
 ] 

Reynold Xin commented on SPARK-14081:
-

This is actually somewhat tricky, because we will lose information when the 
missing value (what user specifies) is being converted from double to float. In 
the case of 0 this is obviously not a problem, but in other cases it might be. 
Also it might be weird to do this only for float but not ints. What do you 
think?


> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14088) Some Dataset API touch-up

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14088:


Assignee: Apache Spark  (was: Reynold Xin)

> Some Dataset API touch-up
> -
>
> Key: SPARK-14088
> URL: https://issues.apache.org/jira/browse/SPARK-14088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> 1. Deprecated unionAll. It is pretty confusing to have both "union" and 
> "unionAll" when the two do the same thing in Spark but are different in SQL.
> 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
> consistent with rest of the functions in KeyValueGroupedDataset. Also makes 
> it more obvious what "reduce" and "reduceGroups" mean. Previously it was 
> confusing because it could be reducing a Dataset, or just reducing groups.
> 3. Added a "name" function, which is more natural to name columns than "as" 
> for non-SQL users.
> 4. Remove "subtract" function since it is just an alias for "except".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14088) Some Dataset API touch-up

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14088:


Assignee: Reynold Xin  (was: Apache Spark)

> Some Dataset API touch-up
> -
>
> Key: SPARK-14088
> URL: https://issues.apache.org/jira/browse/SPARK-14088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> 1. Deprecated unionAll. It is pretty confusing to have both "union" and 
> "unionAll" when the two do the same thing in Spark but are different in SQL.
> 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
> consistent with rest of the functions in KeyValueGroupedDataset. Also makes 
> it more obvious what "reduce" and "reduceGroups" mean. Previously it was 
> confusing because it could be reducing a Dataset, or just reducing groups.
> 3. Added a "name" function, which is more natural to name columns than "as" 
> for non-SQL users.
> 4. Remove "subtract" function since it is just an alias for "except".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14088) Some Dataset API touch-up

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207674#comment-15207674
 ] 

Apache Spark commented on SPARK-14088:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11908

> Some Dataset API touch-up
> -
>
> Key: SPARK-14088
> URL: https://issues.apache.org/jira/browse/SPARK-14088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> 1. Deprecated unionAll. It is pretty confusing to have both "union" and 
> "unionAll" when the two do the same thing in Spark but are different in SQL.
> 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
> consistent with rest of the functions in KeyValueGroupedDataset. Also makes 
> it more obvious what "reduce" and "reduceGroups" mean. Previously it was 
> confusing because it could be reducing a Dataset, or just reducing groups.
> 3. Added a "name" function, which is more natural to name columns than "as" 
> for non-SQL users.
> 4. Remove "subtract" function since it is just an alias for "except".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14088) Some Dataset API touch-up

2016-03-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14088:
---

 Summary: Some Dataset API touch-up
 Key: SPARK-14088
 URL: https://issues.apache.org/jira/browse/SPARK-14088
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


1. Deprecated unionAll. It is pretty confusing to have both "union" and 
"unionAll" when the two do the same thing in Spark but are different in SQL.

2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
consistent with rest of the functions in KeyValueGroupedDataset. Also makes it 
more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing 
because it could be reducing a Dataset, or just reducing groups.

3. Added a "name" function, which is more natural to name columns than "as" for 
non-SQL users.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14088) Some Dataset API touch-up

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14088:

Description: 
1. Deprecated unionAll. It is pretty confusing to have both "union" and 
"unionAll" when the two do the same thing in Spark but are different in SQL.

2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
consistent with rest of the functions in KeyValueGroupedDataset. Also makes it 
more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing 
because it could be reducing a Dataset, or just reducing groups.

3. Added a "name" function, which is more natural to name columns than "as" for 
non-SQL users.

4. Remove "subtract" function since it is just an alias for "except".





  was:
1. Deprecated unionAll. It is pretty confusing to have both "union" and 
"unionAll" when the two do the same thing in Spark but are different in SQL.

2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
consistent with rest of the functions in KeyValueGroupedDataset. Also makes it 
more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing 
because it could be reducing a Dataset, or just reducing groups.

3. Added a "name" function, which is more natural to name columns than "as" for 
non-SQL users.






> Some Dataset API touch-up
> -
>
> Key: SPARK-14088
> URL: https://issues.apache.org/jira/browse/SPARK-14088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> 1. Deprecated unionAll. It is pretty confusing to have both "union" and 
> "unionAll" when the two do the same thing in Spark but are different in SQL.
> 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more 
> consistent with rest of the functions in KeyValueGroupedDataset. Also makes 
> it more obvious what "reduce" and "reduceGroups" mean. Previously it was 
> confusing because it could be reducing a Dataset, or just reducing groups.
> 3. Added a "name" function, which is more natural to name columns than "as" 
> for non-SQL users.
> 4. Remove "subtract" function since it is just an alias for "except".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14066) Set "spark.sql.dialect=sql", there is a problen in running query "select percentile(d,array(0,0.2,0.3,1)) as a from t;"

2016-03-22 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-14066:
--
Description: 
In spark 1.5.1, I run "sh bin/spark-sql  --conf spark.sql.dialect=sql", and run 
query "select percentile(d,array(0,0.2,0.3,1))  as  a from t". There is a 
problem as follows.
{code}
spark-sql> select percentile(d,array(0,0.2,0.3,1))  as  a from t;
16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t
16/03/22 17:25:15 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=t
16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select 
percentile(d,array(0,0.2,0.3,1))  as  a from t]
org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' due 
to data type mismatch: input to function array should all be the same type, but 
it's [int, decimal(1,1), decimal(1,1), int];
{code}

  was:
In spark 1.5.1, I run "sh bin/spark-sql  --conf spark.sql.dialect=sql", and run 
query "select percentile(d,array(0,0.2,0.3,1))  as  a from t". There is a 
problem as follows.

spark-sql> select percentile(d,array(0,0.2,0.3,1))  as  a from t;
16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t
16/03/22 17:25:15 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=t
16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select 
percentile(d,array(0,0.2,0.3,1))  as  a from t]
org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' due 
to data type mismatch: input to function array should all be the same type, but 
it's [int, decimal(1,1), decimal(1,1), int];


> Set "spark.sql.dialect=sql", there is a problen in running query "select 
> percentile(d,array(0,0.2,0.3,1))  as  a from t;"
> -
>
> Key: SPARK-14066
> URL: https://issues.apache.org/jira/browse/SPARK-14066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: KaiXinXIaoLei
>
> In spark 1.5.1, I run "sh bin/spark-sql  --conf spark.sql.dialect=sql", and 
> run query "select percentile(d,array(0,0.2,0.3,1))  as  a from t". There is a 
> problem as follows.
> {code}
> spark-sql> select percentile(d,array(0,0.2,0.3,1))  as  a from t;
> 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t
> 16/03/22 17:25:15 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table 
> : db=default tbl=t
> 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select 
> percentile(d,array(0,0.2,0.3,1))  as  a from t]
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' 
> due to data type mismatch: input to function array should all be the same 
> type, but it's [int, decimal(1,1), decimal(1,1), int];
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14066) Set "spark.sql.dialect=sql", there is a problen in running query "select percentile(d,array(0,0.2,0.3,1)) as a from t;"

2016-03-22 Thread KaiXinXIaoLei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206280#comment-15206280
 ] 

KaiXinXIaoLei edited comment on SPARK-14066 at 3/23/16 1:19 AM:


In the 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion.FunctionArgumentConversion,
 I find in the value `findTightestCommonTypeOfTwo`: 
{code}
case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
  Some(t2)
case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
  Some(t1)
{code}

In `array(0,0.2,0.3,1)`, The type of `0` changes `DecimalType(10, 0)`, The type 
of `0.2` is `DecimalType(1, 1)`, so the value of  `t2.isWiderThan(t1) ` is 
false. So the type of  numbers will be [int, decimal(1,1), decimal(1,1), int]. 
And the query run failed. 
 
So I think the `TightestCommonTypeOfTwo` is not reasonable. Thanks.


was (Author: kaixinxiaolei):
In the 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion.FunctionArgumentConversion,
 I find in the value `findTightestCommonTypeOfTwo`: 
```
case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
  Some(t2)
case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
  Some(t1)
```

In `array(0,0.2,0.3,1)`, The type of `0` changes `DecimalType(10, 0)`, The type 
of `0.2` is `DecimalType(1, 1)`, so the value of  `t2.isWiderThan(t1) ` is 
false. So the type of  numbers will be [int, decimal(1,1), decimal(1,1), int]. 
And the query run failed. 
 
So I think the `TightestCommonTypeOfTwo` is not reasonable. Thanks.

> Set "spark.sql.dialect=sql", there is a problen in running query "select 
> percentile(d,array(0,0.2,0.3,1))  as  a from t;"
> -
>
> Key: SPARK-14066
> URL: https://issues.apache.org/jira/browse/SPARK-14066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: KaiXinXIaoLei
>
> In spark 1.5.1, I run "sh bin/spark-sql  --conf spark.sql.dialect=sql", and 
> run query "select percentile(d,array(0,0.2,0.3,1))  as  a from t". There is a 
> problem as follows.
> {code}
> spark-sql> select percentile(d,array(0,0.2,0.3,1))  as  a from t;
> 16/03/22 17:25:15 INFO HiveMetaStore: 0: get_table : db=default tbl=t
> 16/03/22 17:25:15 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table 
> : db=default tbl=t
> 16/03/22 17:25:16 ERROR SparkSQLDriver: Failed in [select 
> percentile(d,array(0,0.2,0.3,1))  as  a from t]
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(0,0.2,0.3,1)' 
> due to data type mismatch: input to function array should all be the same 
> type, but it's [int, decimal(1,1), decimal(1,1), int];
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14033) Merging Estimator & Model

2016-03-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14033:
--
Summary: Merging Estimator & Model  (was: Merging Estimator, Model, & 
Transformer)

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14087:


Assignee: (was: Apache Spark)

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207604#comment-15207604
 ] 

Apache Spark commented on SPARK-14087:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11906

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14087:


Assignee: Apache Spark

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13806.

   Resolution: Fixed
Fix Version/s: 1.6.2
   1.5.3
   2.1.0

Issue resolved by pull request 11894
[https://github.com/apache/spark/pull/11894]

> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>Assignee: Davies Liu
> Fix For: 2.1.0, 1.5.3, 1.6.2
>
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207555#comment-15207555
 ] 

Bryan Cutler commented on SPARK-14087:
--

I can post a PR for this

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-14087:
-
Attachment: feature.py

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-22 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-14087:


 Summary: PySpark ML JavaModel does not properly own params after 
being fit
 Key: SPARK-14087
 URL: https://issues.apache.org/jira/browse/SPARK-14087
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Bryan Cutler
Priority: Minor


When a PySpark model is created after fitting data, its UID is initialized to 
the parent estimator's value.  Before this assignment, any params defined in 
the model are copied from the object to the class in {{Params._copy_params()}} 
and assigned a different parent UID.  This causes PySpark to think the params 
are not owned by the model and can lead to a {{ValueError}} raised from 
{{Params._shouldOwn()}}, such as:

{noformat}
ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
name='outputCol', doc='output column name.') does not belong to 
CountVectorizer_4c8e9fd539542d783e66.
{noformat}

I encountered this problem while working on SPARK-13967 where I tried to add 
the shared params {{HasInputCol}} and {{HasOutputCol}} to 
{{CountVectorizerModel}}.  See the attached file feature.py for the WIP.

Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
produces the error above.

{noformat}
sc = SparkContext(appName="count_vec_test")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(
[(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
"raw"])
cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(df)
print(model.uid)
for p in model.params:
  print(str(p))
model.transform(df).show(truncate=False)
{noformat}

output (the UIDs should match):
{noformat}
CountVectorizer_4c8e9fd539542d783e66
CountVectorizerModel_4336a81ba742b2593fef__binary
CountVectorizerModel_4336a81ba742b2593fef__inputCol
CountVectorizerModel_4336a81ba742b2593fef__outputCol
{noformat}

In the Scala implementation of this, the model overrides the UID value, which 
the Params use when they are constructed, so they all end up with the parent 
estimator UID.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5991) Python API for ML model import/export

2016-03-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207537#comment-15207537
 ] 

Joseph K. Bradley commented on SPARK-5991:
--

Reopening since we'll need to add items once more Scala implementations are done

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5991) Python API for ML model import/export

2016-03-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-5991:
--

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5991) Python API for ML model import/export

2016-03-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5991:
-
Fix Version/s: (was: 2.0.0)

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14086) Add DDL commands to ANTLR4 Parser

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14086:


Assignee: (was: Apache Spark)

> Add DDL commands to ANTLR4 Parser
> -
>
> Key: SPARK-14086
> URL: https://issues.apache.org/jira/browse/SPARK-14086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14086) Add DDL commands to ANTLR4 Parser

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207528#comment-15207528
 ] 

Apache Spark commented on SPARK-14086:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11905

> Add DDL commands to ANTLR4 Parser
> -
>
> Key: SPARK-14086
> URL: https://issues.apache.org/jira/browse/SPARK-14086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14086) Add DDL commands to ANTLR4 Parser

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14086:


Assignee: Apache Spark

> Add DDL commands to ANTLR4 Parser
> -
>
> Key: SPARK-14086
> URL: https://issues.apache.org/jira/browse/SPARK-14086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5991) Python API for ML model import/export

2016-03-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5991.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.0.0
>
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14085) Star Expansion for Hash and Concat

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14085:


Assignee: (was: Apache Spark)

> Star Expansion for Hash and Concat
> --
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> structDf.select(concat($"*"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14086) Add DDL commands to ANTLR4 Parser

2016-03-22 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-14086:
-

 Summary: Add DDL commands to ANTLR4 Parser
 Key: SPARK-14086
 URL: https://issues.apache.org/jira/browse/SPARK-14086
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14085) Star Expansion for Hash and Concat

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207515#comment-15207515
 ] 

Apache Spark commented on SPARK-14085:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11904

> Star Expansion for Hash and Concat
> --
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> structDf.select(concat($"*"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14085) Star Expansion for Hash and Concat

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14085:


Assignee: Apache Spark

> Star Expansion for Hash and Concat
> --
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> structDf.select(concat($"*"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash and Concat

2016-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14085:

Description: 
Support star expansion in hash and concat. For example
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
structDf.select(concat($"*"))
{code}

  was:
To support star expansion, we can do it like,
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
structDf.select(concat($"*"))
{code}


> Star Expansion for Hash and Concat
> --
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Support star expansion in hash and concat. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> structDf.select(concat($"*"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14085) Star Expansion for Hash and Concat

2016-03-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14085:
---

 Summary: Star Expansion for Hash and Concat
 Key: SPARK-14085
 URL: https://issues.apache.org/jira/browse/SPARK-14085
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


To support star expansion, we can do it like,
{code}
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
structDf.select(concat($"*"))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14084) Parallel training jobs in model selection

2016-03-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14084:
-

 Summary: Parallel training jobs in model selection
 Key: SPARK-14084
 URL: https://issues.apache.org/jira/browse/SPARK-14084
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


In CrossValidator and TrainValidationSplit, we run training jobs one by one. If 
users have a big cluster, they might see speed-ups if we parallelize the jobs. 
The trade-off is that we might need to make multiple copies of the training 
data, which could be expensive. It is worth testing and figure out the best way 
to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14084) Parallel training jobs in model selection

2016-03-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14084:
--
Description: In CrossValidator and TrainValidationSplit, we run training 
jobs one by one. If users have a big cluster, they might see speed-ups if we 
parallelize the job submission on the driver. The trade-off is that we might 
need to make multiple copies of the training data, which could be expensive. It 
is worth testing and figure out the best way to implement it.  (was: In 
CrossValidator and TrainValidationSplit, we run training jobs one by one. If 
users have a big cluster, they might see speed-ups if we parallelize the jobs. 
The trade-off is that we might need to make multiple copies of the training 
data, which could be expensive. It is worth testing and figure out the best way 
to implement it.)

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6717) Clear shuffle files after checkpointing in ALS

2016-03-22 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207459#comment-15207459
 ] 

holdenk commented on SPARK-6717:


So looking at the code a little bit I think its probably better to need to live 
in ALS rather than core.

I don't think we can solve this for all checkpointing in general - when we 
checkpoint, if we are checkpointing a ShuffledRDD directly, its easy to 
register our shuffle files for cleanup but in the more general case (like the 
one in ALS) where we are checkpointing a subsequent RDD we don't know if its 
safe to cleanup the parents shuffle files (in general).

We could expose something `checkPointAndEagerlyCleanParents` in the Core API 
but I think the chance of misuse is pretty high and it might be better to 
implement this inside of ML/ALS until there is a second request for this.

> Clear shuffle files after checkpointing in ALS
> --
>
> Key: SPARK-6717
> URL: https://issues.apache.org/jira/browse/SPARK-6717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>  Labels: als
>
> In ALS iterations, we checkpoint RDDs to cut lineage and to reduce shuffle 
> files. However, whether to clean shuffle files depends on the system GC, 
> which may not be triggered in ALS iterations. So after checkpointing, before 
> we let the RDD object go out of scope, we should clean its shuffle 
> dependencies explicitly. This function could either stay inside ALS or go to 
> Core.
> Without this feature, we can call System.gc() periodically to clean shuffle 
> files of RDDs that went out of scope.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207455#comment-15207455
 ] 

Shixiong Zhu commented on SPARK-14079:
--

It's already there. See "spark.sql.ui.retainedExecutions" in SQLListener

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13952) spark.ml GBT algs need to use random seed

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13952:


Assignee: (was: Apache Spark)

> spark.ml GBT algs need to use random seed
> -
>
> Key: SPARK-13952
> URL: https://issues.apache.org/jira/browse/SPARK-13952
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml.  
> There was one bug I found: The random seed is not used.  A reasonable fix 
> will be to use the original seed to generate a new seed for each tree trained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13952) spark.ml GBT algs need to use random seed

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207444#comment-15207444
 ] 

Apache Spark commented on SPARK-13952:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/11903

> spark.ml GBT algs need to use random seed
> -
>
> Key: SPARK-13952
> URL: https://issues.apache.org/jira/browse/SPARK-13952
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml.  
> There was one bug I found: The random seed is not used.  A reasonable fix 
> will be to use the original seed to generate a new seed for each tree trained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-03-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14083:

Description: 
One big advantage of the Dataset API is the type safety, at the cost of 
performance due to heavy reliance on user-defined closures/lambdas. These 
closures are typically slower than expressions because we can more flexibility 
to optimize expressions (known data types, no virtual function calls, etc). In 
many cases, it's actually not going to be very difficult to look into the byte 
code of these closures and figure out what they are trying to do. If we can 
understand them, then we can turn them directly into Catalyst expressions for 
more optimized executions.

Some examples are:

{code}
df.map(_.name)  // equivalent to expression col("name")

ds.groupBy(_.gender)  // equivalent to expression col("gender")

df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
lit(18)

df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
{code}

The goal of this ticket is to design a small framework for byte code analysis 
and use that to convert closures/lambdas into Catalyst expressions in order to 
speed up Dataset execution. It is a little bit futuristic, but I believe it is 
very doable. The framework should be easy to reason about (e.g. similar to 
Catalyst).

Note that a big emphasis on "small" and "easy to reason about". A patch should 
be rejected if it is too complicated or difficult to reason about.




  was:
In the Dataset API, we are relying more on user-defined functions, which are 
typically slower than expressions because we can more flexibility to optimize 
expressions (known data types, no virtual function calls, etc).

In many cases, it's actually not going to be very difficult to look into the 
byte code of these closures and figure out what they are trying to do. If we 
can understand them, then we can turn them directly into Catalyst expressions 
for more optimized executions.

Some examples are:

{code}
df.map(_.name)  // equivalent to expression col("name")

ds.groupBy(_.gender)  // equivalent to expression col("gender")

df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
lit(18)

df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
{code}

The goal of this ticket is to design a small framework for byte code analysis 
and use that to convert closures/lambdas into Catalyst expressions in order to 
speed up Dataset execution. It is a little bit futuristic, but I believe it is 
very doable. The framework should be easy to reason about (e.g. similar to 
Catalyst).

Note that a big emphasis on "small" and "easy to reason about". A patch should 
be rejected if it is too complicated or difficult to reason about.





> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we can more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13952) spark.ml GBT algs need to use random seed

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13952:


Assignee: Apache Spark

> spark.ml GBT algs need to use random seed
> -
>
> Key: SPARK-13952
> URL: https://issues.apache.org/jira/browse/SPARK-13952
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml.  
> There was one bug I found: The random seed is not used.  A reasonable fix 
> will be to use the original seed to generate a new seed for each tree trained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-03-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14083:
---

 Summary: Analyze JVM bytecode and turn closures into Catalyst 
expressions
 Key: SPARK-14083
 URL: https://issues.apache.org/jira/browse/SPARK-14083
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


In the Dataset API, we are relying more on user-defined functions, which are 
typically slower than expressions because we can more flexibility to optimize 
expressions (known data types, no virtual function calls, etc).

In many cases, it's actually not going to be very difficult to look into the 
byte code of these closures and figure out what they are trying to do. If we 
can understand them, then we can turn them directly into Catalyst expressions 
for more optimized executions.

Some examples are:

{code}
df.map(_.name)  // equivalent to expression col("name")

ds.groupBy(_.gender)  // equivalent to expression col("gender")

df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
lit(18)

df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
{code}

The goal of this ticket is to design a small framework for byte code analysis 
and use that to convert closures/lambdas into Catalyst expressions in order to 
speed up Dataset execution. It is a little bit futuristic, but I believe it is 
very doable. The framework should be easy to reason about (e.g. similar to 
Catalyst).

Note that a big emphasis on "small" and "easy to reason about". A patch should 
be rejected if it is too complicated or difficult to reason about.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207417#comment-15207417
 ] 

Xusen Yin commented on SPARK-14041:
---

[~mengxr] Maybe no need to divide them into several JIRAs, since what we need 
to do is deleting them.

> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

*java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

*java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:Please go through the current example code and list possible duplicates.


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> *java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13019:


Assignee: Xin Ren  (was: Apache Spark)

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207399#comment-15207399
 ] 

Apache Spark commented on SPARK-13019:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/11901

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14082) Add support for GPU resource when running on Mesos

2016-03-22 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-14082:


 Summary: Add support for GPU resource when running on Mesos
 Key: SPARK-14082
 URL: https://issues.apache.org/jira/browse/SPARK-14082
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


As Mesos is integrating GPU as a first class resource, Spark can benefit by 
allowing frameworks to launch their jobs with GPU and using the GPU information 
provided by Mesos to discover/run their jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11666) Find the best `k` by cutting bisecting k-means cluster tree without recomputation

2016-03-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207390#comment-15207390
 ] 

Burak KĂ–SE commented on SPARK-11666:


Hi, can you share links for references about that?

> Find the best `k` by cutting bisecting k-means cluster tree without 
> recomputation
> -
>
> Key: SPARK-11666
> URL: https://issues.apache.org/jira/browse/SPARK-11666
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> For example, scikit-learn's hierarchical clustering support a feature to 
> extract partial tree from the result. We should support a feature like that 
> in order to reduce compute cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-22 Thread Travis Crawford (JIRA)
Travis Crawford created SPARK-14081:
---

 Summary: DataFrameNaFunctions fill should not convert float fields 
to double
 Key: SPARK-14081
 URL: https://issues.apache.org/jira/browse/SPARK-14081
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Travis Crawford


[DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
 provides useful function for dealing with null values in a DataFrame. 
Currently it changes FloatType columns to DoubleType when zero filling. Spark 
should preserve the column data type.

In the following example, notice how `zeroFilledDF` has its `floatField` 
converted from float to double.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("intField", IntegerType),
  StructField("longField", LongType),
  StructField("floatField", FloatType),
  StructField("doubleField", DoubleType)))

val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))

val df = sqlContext.createDataFrame(rdd, schema)

val zeroFilledDF = df.na.fill(0)

// Exiting paste mode, now interpreting.

import org.apache.spark.sql._
import org.apache.spark.sql.types._
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(intField,IntegerType,true), 
StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
StructField(doubleField,DoubleType,true))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[2] at parallelize at :48
df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
floatField: float, doubleField: double]
zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
bigint, floatField: double, doubleField: double]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14080) Improve the codegen for Filter

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14080:


Assignee: (was: Apache Spark)

> Improve the codegen for Filter
> --
>
> Key: SPARK-14080
> URL: https://issues.apache.org/jira/browse/SPARK-14080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>Priority: Minor
>
> Currently, the codegen of null check for Filter sometime generates code as 
> followings:
> /* 072 */   if (!(!(filter_isNull2))) continue;
> It will be better to be:
> /* 072 */   if (filter_isNull2) continue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14080) Improve the codegen for Filter

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207383#comment-15207383
 ] 

Apache Spark commented on SPARK-14080:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11900

> Improve the codegen for Filter
> --
>
> Key: SPARK-14080
> URL: https://issues.apache.org/jira/browse/SPARK-14080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>Priority: Minor
>
> Currently, the codegen of null check for Filter sometime generates code as 
> followings:
> /* 072 */   if (!(!(filter_isNull2))) continue;
> It will be better to be:
> /* 072 */   if (filter_isNull2) continue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14080) Improve the codegen for Filter

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14080:


Assignee: Apache Spark

> Improve the codegen for Filter
> --
>
> Key: SPARK-14080
> URL: https://issues.apache.org/jira/browse/SPARK-14080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the codegen of null check for Filter sometime generates code as 
> followings:
> /* 072 */   if (!(!(filter_isNull2))) continue;
> It will be better to be:
> /* 072 */   if (filter_isNull2) continue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14080) Improve the codegen for Filter

2016-03-22 Thread Bo Meng (JIRA)
Bo Meng created SPARK-14080:
---

 Summary: Improve the codegen for Filter
 Key: SPARK-14080
 URL: https://issues.apache.org/jira/browse/SPARK-14080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Bo Meng
Priority: Minor


Currently, the codegen of null check for Filter sometime generates code as 
followings:
/* 072 */   if (!(!(filter_isNull2))) continue;

It will be better to be:
/* 072 */   if (filter_isNull2) continue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207349#comment-15207349
 ] 

Davies Liu commented on SPARK-14079:


Yes, that's what I meant.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-03-22 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren reopened SPARK-13019:
-

need to fix scala-2.10 compile

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13449) Naive Bayes wrapper in SparkR

2016-03-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13449.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11890
[https://github.com/apache/spark/pull/11890]

> Naive Bayes wrapper in SparkR
> -
>
> Key: SPARK-13449
> URL: https://issues.apache.org/jira/browse/SPARK-13449
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 2.0.0
>
>
> Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's 
> naive Bayes implementation is from package e1071 with signature:
> {code}
> ## S3 method for class 'formula'
> naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
> ## Default S3 method, which we don't want to support
> # naiveBayes(x, y, laplace = 0, ...)
> ## S3 method for class 'naiveBayes'
> predict(object, newdata,
>   type = c("class", "raw"), threshold = 0.001, eps = 0, ...)
> {code}
> It should be easy for us to match the parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-03-22 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207294#comment-15207294
 ] 

Sunitha Kambhampati commented on SPARK-14040:
-

I can reproduce this on my master ( v2.0 snapshot, synced to today).  I tried 
the first scenario from the description. 

> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207275#comment-15207275
 ] 

Andrew Or commented on SPARK-14079:
---

we should do `maxRetainedQueries` or something, similar to what we already do 
for the All Jobs page.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14075) Refactor MemoryStore to be testable independent of BlockManager

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207266#comment-15207266
 ] 

Apache Spark commented on SPARK-14075:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11899

> Refactor MemoryStore to be testable independent of BlockManager
> ---
>
> Key: SPARK-14075
> URL: https://issues.apache.org/jira/browse/SPARK-14075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> It would be nice to refactor the MemoryStore so that it can be unit-tested 
> without constructing a full BlockManager or needing to mock tons of things.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn

2016-03-22 Thread Satish Kolli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236
 ] 

Satish Kolli edited comment on SPARK-13514 at 3/22/16 8:37 PM:
---

I just upgraded the shuffle service to 1.6.1 and the *YARN node managers* have 
the following. I used the same code I used in my original post to test it.

{code}
ERROR org.apache.spark.network.TransportContext: Error while initializing Netty 
pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to 
initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => 
/..:7337]
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error 
while initializing Netty pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 

[jira] [Comment Edited] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn

2016-03-22 Thread Satish Kolli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236
 ] 

Satish Kolli edited comment on SPARK-13514 at 3/22/16 8:34 PM:
---

I just upgraded the shuffle service to 1.6.1 and the *YARN node managers* have 
the following. I used the same code I used in my original post to test it.

{code}
ERROR org.apache.spark.network.TransportContext: Error while initializing Netty 
pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to 
initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => 
/..:7337]
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error 
while initializing Netty pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 

[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-22 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207239#comment-15207239
 ] 

JESSE CHEN commented on SPARK-13864:


Tried on two recent builds having issues running to completion. Something is 
broken. Looking into why...

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Miller]
> [LJNCBAAA,George,Gamez]
> [LIOP,Derek,Allen]
> [EADJ,Ruth,Carroll]
> [JGMM,Richard,Larson]
> [PKIK,Wendy,Horvath]
> [FJHF,Larissa,Roy]
> [EPOG,Felisha,Mendes]
> [EKJL,Aisha,Carlson]
> [HNFH,Rebecca,Wilson]
> [IBFCBAAA,Ruth,Grantham]
> [OPDL,Ann,Pence]
> [NIPL,Eric,Lawrence]
> [OCIC,Zachary,Pennington]
> [OFLC,James,Taylor]
> [GEHI,Tyler,Miller]
> [CADP,Cristobal,Thomas]
> [JIAL,Santos,Gutierrez]
> [PMMBBAAA,Paul,Jordan]
> [DIIO,David,Carroll]
> [DFKABAAA,Latoya,Craft]
> [HMOI,Grace,Henderson]
> [PPIBBAAA,Candice,Lee]
> [JONHBAAA,Warren,Orozco]
> [GNDA,Terry,Mcdowell]
> [CIJM,Elizabeth,Thomas]
> [DIJGBAAA,Ruth,Sanders]
> [NFBDBAAA,Vernice,Fernandez]
> [IDKF,Michael,Mack]
> [IMHB,Kathy,Knowles]
> [LHMC,Brooke,Nelson]
> [CFCGBAAA,Marcus,Sanders]
> [NJHCBAAA,Christopher,Schreiber]
> [PDFB,Terrance,Banks]
> [ANFA,Philip,Banks]
> [IADEBAAA,Diane,Aldridge]
> [ICHF,Linda,Mccoy]
> [CFEN,Christopher,Dawson]
> [KOJJ,Gracie,Mendoza]
> [FOJA,Don,Castillo]
> [FGPG,Albert,Wadsworth]
> [KJBK,Georgia,Scott]
> [EKFP,Annika,Chin]
> [IBAEBAAA,Sandra,Wilson]
> [MFFL,Margret,Gray]
> [KNAK,Gladys,Banks]
> [CJDI,James,Kerr]
> [OBADBAAA,Elizabeth,Burnham]
> [AMGD,Kenneth,Harlan]
> [HJLA,Audrey,Beltran]
> [AOPFBAAA,Jerry,Fields]
> [CNAGBAAA,Virginia,May]
> [HGOABAAA,Sonia,White]
> [KBCABAAA,Debra,Bell]
> [NJAG,Allen,Hood]
> [MMOBBAAA,Margaret,Smith]
> [NGDBBAAA,Carlos,Jewell]
> [FOGI,Michelle,Greene]
> [JEKFBAAA,Norma,Burkholder]
> [OCAJ,Jenna,Staton]
> [PFCL,Felicia,Neville]
> [DLHBBAAA,Henry,Bertrand]
> [DBEFBAAA,Bennie,Bowers]
> [DCKO,Robert,Gonzalez]
> [KKGE,Katie,Dunbar]
> [GFMDBAAA,Kathleen,Gibson]
> [IJEM,Charlie,Cummings]
> [KJBL,Kerry,Davis]
> [JKBN,Julie,Kern]
> [MDCA,Louann,Hamel]
> [EOAK,Molly,Benjamin]
> [IBHH,Jennifer,Ballard]
> [PJEN,Ashley,Norton]
> [KLHHBAAA,Manuel,Castaneda]
> [IMHHBAAA,Lillian,Davidson]
> [GHPBBAAA,Nick,Mendez]
> [BNBB,Irma,Smith]
> [FBAH,Michael,Williams]
> [PEHEBAAA,Edith,Molina]
> [FMHI,Emilio,Darling]
> [KAEC,Milton,Mackey]
> [OCDJ,Nina,Sanchez]
> [FGIG,Eduardo,Miller]
> [FHACBAAA,null,null]
> [HMJN,Ryan,Baptiste]
> [HHCABAAA,William,Stewart]
> {noformat}
> Expected results:
> {noformat}
> +--+-++
> | CUSTOMER_ID  | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME |
> +--+-++
> | AMGD | Kenneth | Harlan |
> | ANFA | Philip  | Banks  |
> | AOPFBAAA | Jerry   | Fields |
> | BLEIBAAA | Paula   | Wakefield  |
> | BNBB | Irma| Smith  |
> | CADP | Cristobal   | Thomas |
> | CFCGBAAA | Marcus  | Sanders

[jira] [Commented] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn

2016-03-22 Thread Satish Kolli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207236#comment-15207236
 ] 

Satish Kolli commented on SPARK-13514:
--

I just upgraded the shuffle service with 1.6.1 and the *YARN node managers* 
have the following. I used the same code I used in my original post

{code}
ERROR org.apache.spark.network.TransportContext: Error while initializing Netty 
pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:04,234 WARN io.netty.channel.ChannelInitializer: Failed to 
initialize a channel. Closing: [id: 0x80721ee3, /..:43869 => 
/..:7337]
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
2016-03-22 16:29:05,023 ERROR org.apache.spark.network.TransportContext: Error 
while initializing Netty pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
at 

[jira] [Updated] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14079:
---
Description: The SQL UI become very very slow if there are hundreds of SQL 
queries on it.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207231#comment-15207231
 ] 

Davies Liu commented on SPARK-14079:


cc [~zsxwing] [~andrewor14]

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13887) PyLint should fail fast to make errors easier to discover

2016-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207227#comment-15207227
 ] 

Apache Spark commented on SPARK-13887:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11898

> PyLint should fail fast to make errors easier to discover
> -
>
> Key: SPARK-13887
> URL: https://issues.apache.org/jira/browse/SPARK-13887
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Right now our PyLint script runs all of the checks and then returns the 
> output, this can make it difficult to find the part which errored and 
> complicates the script a bit. We can simplify out script to fail fast which 
> will both simplify the script and make it easier to discover the errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14079) Limit the number of queries on SQL UI

2016-03-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14079:
--

 Summary: Limit the number of queries on SQL UI
 Key: SPARK-14079
 URL: https://issues.apache.org/jira/browse/SPARK-14079
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13887) PyLint should fail fast to make errors easier to discover

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13887:


Assignee: (was: Apache Spark)

> PyLint should fail fast to make errors easier to discover
> -
>
> Key: SPARK-13887
> URL: https://issues.apache.org/jira/browse/SPARK-13887
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Right now our PyLint script runs all of the checks and then returns the 
> output, this can make it difficult to find the part which errored and 
> complicates the script a bit. We can simplify out script to fail fast which 
> will both simplify the script and make it easier to discover the errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13887) PyLint should fail fast to make errors easier to discover

2016-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13887:


Assignee: Apache Spark

> PyLint should fail fast to make errors easier to discover
> -
>
> Key: SPARK-13887
> URL: https://issues.apache.org/jira/browse/SPARK-13887
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Right now our PyLint script runs all of the checks and then returns the 
> output, this can make it difficult to find the part which errored and 
> complicates the script a bit. We can simplify out script to fail fast which 
> will both simplify the script and make it easier to discover the errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-22 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207219#comment-15207219
 ] 

Gayathri Murali edited comment on SPARK-13733 at 3/22/16 8:28 PM:
--

[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

1. For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?
2. What is the idea behind having an initial weight distribution for other 
vertices in personalized page rank? 





was (Author: gayathrimurali):
[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?




> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-22 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207219#comment-15207219
 ] 

Gayathri Murali commented on SPARK-13733:
-

[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?




> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-22 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13858.
--
Resolution: Not A Bug

Schema updates generated correct results in both spark 1.6 and 2.0. Good to 
close. 

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. 

[jira] [Commented] (SPARK-13971) Implicit group by with distinct modifier on having raises an unexpected error

2016-03-22 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207168#comment-15207168
 ] 

Sunitha Kambhampati commented on SPARK-13971:
-

fwiw, it is not the exact same env but I tried the same query against the 
master ( v2.0 snapshot) and it worked fine using the sqlContext.  

> Implicit group by with distinct modifier on having raises an unexpected error
> -
>
> Key: SPARK-13971
> URL: https://issues.apache.org/jira/browse/SPARK-13971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark standalone mode installed on Centos7
>Reporter: Javier PĂ©rez
>
> 1. Start-thriftserver
> 2. connect with beeline
> 3 perform the following query over a simple talbe:
> SELECT COUNT(DISTINCT field1) FROM test_table HAVING COUNT(DISTINCT field1) = 
> 3
> TRACE:
> ERROR SparkExecuteStatementOperation: Error running hive query: 
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> gid#13616,field1#13617 missing from 
> field1#13612,field2#13611,field2#13608,field3#13610,field4#13613,field5#13609 
> in operator !Expand [List(null, 0, if ((gid#13616 = 1)) field1#13617 else 
> null),List(field2#13608, 1, null)], [field2#13619,gid#13618,if ((gid = 1)) 
> field1 else null#13620];
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:246)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >