[jira] [Assigned] (SPARK-22934) Make optional clauses order insensitive for CREATE TABLE SQL statement
[ https://issues.apache.org/jira/browse/SPARK-22934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22934: Assignee: Apache Spark (was: Xiao Li) > Make optional clauses order insensitive for CREATE TABLE SQL statement > -- > > Key: SPARK-22934 > URL: https://issues.apache.org/jira/browse/SPARK-22934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Apache Spark > > Each time, when I write a complex Create Table statement, I have to open the > .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When > the order is not right, I will get A strange confusing error message > generated from ALTR4. > {noformat} > CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1 col_type1 [COMMENT col_comment1], ...)] > USING datasource > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} > The same idea is also applicable to Create Hive Table. > {noformat} > CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22934) Make optional clauses order insensitive for CREATE TABLE SQL statement
[ https://issues.apache.org/jira/browse/SPARK-22934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307380#comment-16307380 ] Apache Spark commented on SPARK-22934: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20133 > Make optional clauses order insensitive for CREATE TABLE SQL statement > -- > > Key: SPARK-22934 > URL: https://issues.apache.org/jira/browse/SPARK-22934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > Each time, when I write a complex Create Table statement, I have to open the > .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When > the order is not right, I will get A strange confusing error message > generated from ALTR4. > {noformat} > CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1 col_type1 [COMMENT col_comment1], ...)] > USING datasource > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} > The same idea is also applicable to Create Hive Table. > {noformat} > CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22934) Make optional clauses order insensitive for CREATE TABLE SQL statement
[ https://issues.apache.org/jira/browse/SPARK-22934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22934: Assignee: Xiao Li (was: Apache Spark) > Make optional clauses order insensitive for CREATE TABLE SQL statement > -- > > Key: SPARK-22934 > URL: https://issues.apache.org/jira/browse/SPARK-22934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > Each time, when I write a complex Create Table statement, I have to open the > .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When > the order is not right, I will get A strange confusing error message > generated from ALTR4. > {noformat} > CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1 col_type1 [COMMENT col_comment1], ...)] > USING datasource > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] > [LOCATION path] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} > The same idea is also applicable to Create Hive Table. > {noformat} > CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > {noformat} > The proposal is to make the following clauses order insensitive. > {noformat} > [COMMENT table_comment] > [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] > [ROW FORMAT row_format] > [STORED AS file_format] > [LOCATION path] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22934) Make optional clauses order insensitive for CREATE TABLE SQL statement
Xiao Li created SPARK-22934: --- Summary: Make optional clauses order insensitive for CREATE TABLE SQL statement Key: SPARK-22934 URL: https://issues.apache.org/jira/browse/SPARK-22934 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Xiao Li Assignee: Xiao Li Each time, when I write a complex Create Table statement, I have to open the .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When the order is not right, I will get A strange confusing error message generated from ALTR4. {noformat} CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] {noformat} The proposal is to make the following clauses order insensitive. {noformat} [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] {noformat} The same idea is also applicable to Create Hive Table. {noformat} CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] {noformat} The proposal is to make the following clauses order insensitive. {noformat} [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22315) Check for version match between R package and JVM
[ https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22315: Assignee: Shivaram Venkataraman > Check for version match between R package and JVM > - > > Key: SPARK-22315 > URL: https://issues.apache.org/jira/browse/SPARK-22315 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 2.2.1, 2.3.0 > > > With the release of SparkR on CRAN we could have scenarios where users have a > newer version of package when compared to the Spark cluster they are > connecting to. > We should print appropriate warnings on either (a) connecting to a different > version R Backend (b) connecting to a Spark master running a different > version of Spark (this should ideally happen inside Scala ?) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307333#comment-16307333 ] Apache Spark commented on SPARK-13030: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/20132 > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-13030. --- Resolution: Fixed Fix Version/s: 2.3.0 Resolved by https://github.com/apache/spark/pull/19527 > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-13030: - Assignee: Liang-Chi Hsieh > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22857) Optimize code by inspecting code
[ https://issues.apache.org/jira/browse/SPARK-22857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22857. --- Resolution: Won't Fix > Optimize code by inspecting code > - > > Key: SPARK-22857 > URL: https://issues.apache.org/jira/browse/SPARK-22857 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.1 >Reporter: xubo245 >Priority: Minor > > Optimize code by inspecting code -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21822. --- Resolution: Not A Problem > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22725) df.select on a Stream is broken, vs a List
[ https://issues.apache.org/jira/browse/SPARK-22725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22725. --- Resolution: Won't Fix No follow up and not clear this is expected to work > df.select on a Stream is broken, vs a List > -- > > Key: SPARK-22725 > URL: https://issues.apache.org/jira/browse/SPARK-22725 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Andrew Ash >Priority: Minor > > See failing test at https://github.com/apache/spark/pull/19917 > Failing: > {noformat} > test("SPARK-ABC123: support select with a splatted stream") { > val df = spark.createDataFrame(sparkContext.emptyRDD[Row], > StructType(List("bar", "foo").map { > StructField(_, StringType, false) > })) > val allColumns = Stream(df.col("bar"), col("foo")) > val result = df.select(allColumns : _*) > } > {noformat} > Succeeds: > {noformat} > test("SPARK-ABC123: support select with a splatted stream") { > val df = spark.createDataFrame(sparkContext.emptyRDD[Row], > StructType(List("bar", "foo").map { > StructField(_, StringType, false) > })) > val allColumns = Seq(df.col("bar"), col("foo")) > val result = df.select(allColumns : _*) > } > {noformat} > After stepping through in a debugger, the difference manifests at > https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120 > Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass. > I think there's a very subtle bug here where the {{Seq}} of column names > passed into {{select}} is expected to eagerly evaluate when {{.map}} is > called on it, even though that's not part of the {{Seq}} contract. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy
[ https://issues.apache.org/jira/browse/SPARK-22933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307294#comment-16307294 ] Apache Spark commented on SPARK-22933: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/20129 > R Structured Streaming API for withWatermark, trigger, partitionBy > -- > > Key: SPARK-22933 > URL: https://issues.apache.org/jira/browse/SPARK-22933 > Project: Spark > Issue Type: Bug > Components: SparkR, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy
[ https://issues.apache.org/jira/browse/SPARK-22933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22933: Assignee: Felix Cheung (was: Apache Spark) > R Structured Streaming API for withWatermark, trigger, partitionBy > -- > > Key: SPARK-22933 > URL: https://issues.apache.org/jira/browse/SPARK-22933 > Project: Spark > Issue Type: Bug > Components: SparkR, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy
[ https://issues.apache.org/jira/browse/SPARK-22933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22933: Assignee: Apache Spark (was: Felix Cheung) > R Structured Streaming API for withWatermark, trigger, partitionBy > -- > > Key: SPARK-22933 > URL: https://issues.apache.org/jira/browse/SPARK-22933 > Project: Spark > Issue Type: Bug > Components: SparkR, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy
Felix Cheung created SPARK-22933: Summary: R Structured Streaming API for withWatermark, trigger, partitionBy Key: SPARK-22933 URL: https://issues.apache.org/jira/browse/SPARK-22933 Project: Spark Issue Type: Bug Components: SparkR, Structured Streaming Affects Versions: 2.3.0 Reporter: Felix Cheung Assignee: Felix Cheung -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307277#comment-16307277 ] Neil Alexander McQuarrie commented on SPARK-21727: -- [~felixcheung] Okay thanks -- sorry, need a few more days > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil Alexander McQuarrie >Assignee: Neil Alexander McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22867) Add Isolation Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22867. --- Resolution: Won't Fix > Add Isolation Forest algorithm to MLlib > --- > > Key: SPARK-22867 > URL: https://issues.apache.org/jira/browse/SPARK-22867 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Fangzhou Yang > > Isolation Forest (iForest) is an effective model that focuses on anomaly > isolation. > iForest uses tree structure for modeling data, iTree isolates anomalies > closer to the root of the tree as compared to normal points. > A anomaly score is calculated by iForest model to measure the abnormality of > the data instances. The lower, the more abnormal. > More details about iForest can be found in the following papers: > https://dl.acm.org/citation.cfm?id=1511387";>Isolation Forest [1] > and https://dl.acm.org/citation.cfm?id=2133363";>Isolation-Based > Anomaly Detection [2]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages
[ https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22929. --- Resolution: Not A Problem Pardon if I misunderstood, and reopen it, but if it's just a typo I assume this isn't a bug. > Short name for "kafka" doesn't work in pyspark with packages > > > Key: SPARK-22929 > URL: https://issues.apache.org/jira/browse/SPARK-22929 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Critical > > When I start pyspark using the following command: > {code} > bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 > {code} > The following throws an error: > {code} > spark.read.format("kakfa")... > py4j.protocol.Py4JJavaError: An error occurred while calling o35.load. > : java.lang.ClassNotFoundException: Failed to find data source: kakfa. Please > find packages at http://spark.apache.org/third-party-projects.html > {code} > The following does work: > {code} > spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22871) Add GBT+LR Algorithm in MLlib
[ https://issues.apache.org/jira/browse/SPARK-22871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22871. --- Resolution: Won't Fix > Add GBT+LR Algorithm in MLlib > - > > Key: SPARK-22871 > URL: https://issues.apache.org/jira/browse/SPARK-22871 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Fangzhou Yang > > GBTLRClassifier is a hybrid model of Gradient Boosting Trees and Logistic > Regression. > It is quite practical and popular in many data mining competitions. In this > hybrid model, input features are transformed by means of boosted decision > trees. The output of each individual tree is treated as a categorical input > feature to a sparse linear classifer. Boosted decision trees prove to be very > powerful feature transforms. > Model details about GBTLR can be found in the following paper: > https://dl.acm.org/citation.cfm?id=2648589";>Practical Lessons from > Predicting Clicks on Ads at Facebook -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22142) Move Flume support behind a profile
[ https://issues.apache.org/jira/browse/SPARK-22142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307225#comment-16307225 ] Apache Spark commented on SPARK-22142: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20128 > Move Flume support behind a profile > --- > > Key: SPARK-22142 > URL: https://issues.apache.org/jira/browse/SPARK-22142 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Fix For: 2.3.0 > > > Kafka 0.8 support was recently put behind a profile. YARN, Mesos, Kinesis, > Docker-related integration are behind profiles. Flume support seems like it > could as well, to make it opt-in for builds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21893) Put Kafka 0.8 behind a profile
[ https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307224#comment-16307224 ] Apache Spark commented on SPARK-21893: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20128 > Put Kafka 0.8 behind a profile > -- > > Key: SPARK-21893 > URL: https://issues.apache.org/jira/browse/SPARK-21893 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Fix For: 2.3.0 > > > Kafka does not support 0.8.x for Scala 2.12. This code will have to, at > least, be optionally enabled by a profile, which could be enabled by default > for 2.11. Or outright removed. > Update: it'll also require removing 0.8.x examples, because otherwise the > example module has to be split. > While not necessarily connected, it's probably a decent point to declare 0.8 > deprecated. And that means declaring 0.10 (the other API left) as stable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22932) Refactor AnalysisContext
[ https://issues.apache.org/jira/browse/SPARK-22932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22932: Assignee: Apache Spark (was: Xiao Li) > Refactor AnalysisContext > > > Key: SPARK-22932 > URL: https://issues.apache.org/jira/browse/SPARK-22932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22932) Refactor AnalysisContext
[ https://issues.apache.org/jira/browse/SPARK-22932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22932: Assignee: Xiao Li (was: Apache Spark) > Refactor AnalysisContext > > > Key: SPARK-22932 > URL: https://issues.apache.org/jira/browse/SPARK-22932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22932) Refactor AnalysisContext
[ https://issues.apache.org/jira/browse/SPARK-22932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307212#comment-16307212 ] Apache Spark commented on SPARK-22932: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20127 > Refactor AnalysisContext > > > Key: SPARK-22932 > URL: https://issues.apache.org/jira/browse/SPARK-22932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22932) Refactor AnalysisContext
Xiao Li created SPARK-22932: --- Summary: Refactor AnalysisContext Key: SPARK-22932 URL: https://issues.apache.org/jira/browse/SPARK-22932 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22871) Add GBT+LR Algorithm in MLlib
[ https://issues.apache.org/jira/browse/SPARK-22871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307210#comment-16307210 ] Nick Pentreath commented on SPARK-22871: Tree-based feature transformation is covered in SPARK-13677. I think this duplicates that ticket. I also think it is best to leave the functionality separate rather than create a new estimator in Spark. i.e. we could add the leaf-based feature transformation to the tree models, and leave it up to the user to combine that with LR etc. I think this separation of concerns and modularity is better. Finally, as [~srowen] mentions in SPARK-22867, I think this particular model is best kept as a separate Spark package. > Add GBT+LR Algorithm in MLlib > - > > Key: SPARK-22871 > URL: https://issues.apache.org/jira/browse/SPARK-22871 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Fangzhou Yang > > GBTLRClassifier is a hybrid model of Gradient Boosting Trees and Logistic > Regression. > It is quite practical and popular in many data mining competitions. In this > hybrid model, input features are transformed by means of boosted decision > trees. The output of each individual tree is treated as a categorical input > feature to a sparse linear classifer. Boosted decision trees prove to be very > powerful feature transforms. > Model details about GBTLR can be found in the following paper: > https://dl.acm.org/citation.cfm?id=2648589";>Practical Lessons from > Predicting Clicks on Ads at Facebook -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical
[ https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22801. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19991 [https://github.com/apache/spark/pull/19991] > Allow FeatureHasher to specify numeric columns to treat as categorical > -- > > Key: SPARK-22801 > URL: https://issues.apache.org/jira/browse/SPARK-22801 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath > Fix For: 2.3.0 > > > {{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as > numbers and never as categorical features. It is quite common to have > categorical features represented as numbers or codes (often say {{Int}}) in > data sources. > In order to hash these features as categorical, users must first explicitly > convert them to strings which is cumbersome. > Add a new param {{categoricalCols}} which specifies the numeric columns that > should be treated as categorical features. > *Note* while the reverse case is certainly possible (i.e. numeric features > that are encoded as strings and a user would like to treat them as numeric), > this is probably less likely and this case won't be supported at this time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22397: -- Assignee: Huaxin Gao > Add multiple column support to QuantileDiscretizer > -- > > Key: SPARK-22397 > URL: https://issues.apache.org/jira/browse/SPARK-22397 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Huaxin Gao > Fix For: 2.3.0 > > > Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add > multi column support to the {{QuantileDiscretizer}} too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22397) Add multiple column support to QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22397. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19715 [https://github.com/apache/spark/pull/19715] > Add multiple column support to QuantileDiscretizer > -- > > Key: SPARK-22397 > URL: https://issues.apache.org/jira/browse/SPARK-22397 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > Fix For: 2.3.0 > > > Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add > multi column support to the {{QuantileDiscretizer}} too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources
[ https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307187#comment-16307187 ] Apache Spark commented on SPARK-17967: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20125 > Support for list or other types as an option for datasources > > > Key: SPARK-17967 > URL: https://issues.apache.org/jira/browse/SPARK-17967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Hyukjin Kwon > > This was discussed in SPARK-17878 > For other datasources, it seems okay with string/long/boolean/double value as > an option but it seems it is not enough for the datasource such as CSV. As it > is an interface for other external datasources, I guess it'd affect several > ones out there. > I took a look a first but it seems it'd be difficult to support this (need to > change a lot). > One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages
[ https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307149#comment-16307149 ] Jacek Laskowski commented on SPARK-22929: - When I saw the issue I was so much surprised as that's perhaps of the most often used data sources on...StackOverflow :) But then I'm not using pyspark (and so `spark-submit` may have got hosed for why it deals with python). > Short name for "kafka" doesn't work in pyspark with packages > > > Key: SPARK-22929 > URL: https://issues.apache.org/jira/browse/SPARK-22929 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Critical > > When I start pyspark using the following command: > {code} > bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 > {code} > The following throws an error: > {code} > spark.read.format("kakfa")... > py4j.protocol.Py4JJavaError: An error occurred while calling o35.load. > : java.lang.ClassNotFoundException: Failed to find data source: kakfa. Please > find packages at http://spark.apache.org/third-party-projects.html > {code} > The following does work: > {code} > spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307147#comment-16307147 ] Apache Spark commented on SPARK-22126: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20124 > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { modelMap => > modelMap.map { case (index: Int, model: Model[_]) => > val metric = eval.evaluate(model.transform(validationDataset, > paramMaps(index))) > (index, me
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307145#comment-16307145 ] Bryan Cutler commented on SPARK-22126: -- Hi All, I've been following the discussions here and the proposed solution seems pretty flexible to be able to do all that is required. If it's ok with you all, I'd still like to submit a PR with an alternate implementation which I brought up way back when this issue came up in SPARK-19357. It is a bit more simple and only adds a basic method to the Estimator API, but still brings back support for model-specific optimization to where it was before any of the parallelism was introduced. Apologies if I am missing something from all the previous discussion that requires a more involved API changes, but I just thought I would bring this up since it is a little more simple and seems to meet our needs, from what I can tell. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Un
[jira] [Commented] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages
[ https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307144#comment-16307144 ] Michael Armbrust commented on SPARK-22929: -- Haha, thanks [~sowen], you are right. Kafka is a hard word I guess :) > Short name for "kafka" doesn't work in pyspark with packages > > > Key: SPARK-22929 > URL: https://issues.apache.org/jira/browse/SPARK-22929 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Critical > > When I start pyspark using the following command: > {code} > bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 > {code} > The following throws an error: > {code} > spark.read.format("kakfa")... > py4j.protocol.Py4JJavaError: An error occurred while calling o35.load. > : java.lang.ClassNotFoundException: Failed to find data source: kakfa. Please > find packages at http://spark.apache.org/third-party-projects.html > {code} > The following does work: > {code} > spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org