[jira] [Updated] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-13821: -- Description: TPC-DS Query 20 Fails to compile with the follwing Error Message {noformat} Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) {noformat} was: TPC-DS Query 20 Fails to compile with the follwing Error Message {format} Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) {format} > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at
[jira] [Updated] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-13821: -- Description: TPC-DS Query 20 Fails to compile with the follwing Error Message {format} Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) {format} was: TPC-DS Query 20 Fails to compile with the follwing Error Message Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {format} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at
[jira] [Resolved] (SPARK-13794) Rename DataFrameWriter.stream DataFrameWriter.startStream
[ https://issues.apache.org/jira/browse/SPARK-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha resolved SPARK-13794. --- Resolution: Fixed Fix Version/s: 2.0.0 > Rename DataFrameWriter.stream DataFrameWriter.startStream > - > > Key: SPARK-13794 > URL: https://issues.apache.org/jira/browse/SPARK-13794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This makes it more obvious with the verb "start" that we are actually > starting some execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-13795: -- Description: DataFrame Schema (by printSchema() ) is as follows allDataJoined.printSchema() {noformat} |-- eventType: string (nullable = true) |-- itemId: string (nullable = true) |-- productId: string (nullable = true) |-- productVersion: string (nullable = true) |-- servicedBy: string (nullable = true) |-- ACCOUNT_NAME: string (nullable = true) |-- CONTENTGROUPID: string (nullable = true) |-- PRODUCT_ID: string (nullable = true) |-- PROFILE_ID: string (nullable = true) |-- SALESADVISEREMAIL: string (nullable = true) |-- businessName: string (nullable = true) |-- contentGroupId: string (nullable = true) |-- salesAdviserName: string (nullable = true) |-- salesAdviserPhone: string (nullable = true) {noformat} There is NO column that has any datatype except String. There used to be previously an inferred column of type long that was dropped {code} DataFrame allDataJoined = whiteEventJoinedWithReference. drop(rliDataFrame.col("occurredAtDate")); allDataJoined.printSchema() : output above ^^ Now allDataJoined.show() {code} throws the following exception vv {noformat} java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Ordering$Int$.compare(Ordering.scala:256) at scala.math.Ordering$class.gt(Ordering.scala:97) at scala.math.Ordering$Int$.gt(Ordering.scala:256) at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383) at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88) at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-13795: -- Description: DataFrame Schema (by printSchema() ) is as follows allDataJoined.printSchema() {noformat} |-- eventType: string (nullable = true) |-- itemId: string (nullable = true) |-- productId: string (nullable = true) |-- productVersion: string (nullable = true) |-- servicedBy: string (nullable = true) |-- ACCOUNT_NAME: string (nullable = true) |-- CONTENTGROUPID: string (nullable = true) |-- PRODUCT_ID: string (nullable = true) |-- PROFILE_ID: string (nullable = true) |-- SALESADVISEREMAIL: string (nullable = true) |-- businessName: string (nullable = true) |-- contentGroupId: string (nullable = true) |-- salesAdviserName: string (nullable = true) |-- salesAdviserPhone: string (nullable = true) {noformat} There is NO column that has any datatype except String. There used to be previously an inferred column of type long that was dropped DataFrame allDataJoined = whiteEventJoinedWithReference. drop(rliDataFrame.col("occurredAtDate")); allDataJoined.printSchema() : output above ^^ Now allDataJoined.show() throws the following exception vv java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Ordering$Int$.compare(Ordering.scala:256) at scala.math.Ordering$class.gt(Ordering.scala:97) at scala.math.Ordering$Int$.gt(Ordering.scala:256) at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383) at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88) at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349) at
[jira] [Created] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty
Ram Sriharsha created SPARK-10623: - Summary: turning on predicate pushdown throws nonsuch element exception when RDD is empty Key: SPARK-10623 URL: https://issues.apache.org/jira/browse/SPARK-10623 Project: Spark Issue Type: Bug Reporter: Ram Sriharsha Turning on predicate pushdown for ORC datasources results in a NoSuchElementException: scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15") df: org.apache.spark.sql.DataFrame = [name: string] scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true") scala> df.explain == Physical Plan == java.util.NoSuchElementException Disabling the pushdown makes things work again: scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false") scala> df.explain == Physical Plan == Project [name#6] Filter (age#7 < 15) Scan OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-10623: -- Assignee: Zhan Zhang > turning on predicate pushdown throws nonsuch element exception when RDD is > empty > - > > Key: SPARK-10623 > URL: https://issues.apache.org/jira/browse/SPARK-10623 > Project: Spark > Issue Type: Bug >Reporter: Ram Sriharsha >Assignee: Zhan Zhang > > Turning on predicate pushdown for ORC datasources results in a > NoSuchElementException: > scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15") > df: org.apache.spark.sql.DataFrame = [name: string] > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > scala> df.explain > == Physical Plan == > java.util.NoSuchElementException > Disabling the pushdown makes things work again: > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false") > scala> df.explain > == Physical Plan == > Project [name#6] > Filter (age#7 < 15) > Scan > OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9670) Examples: Check for new APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741196#comment-14741196 ] Ram Sriharsha commented on SPARK-9670: -- Hi Joseph, I don't have any more items to add to this list. But SPARK-7546 is still open. I am not happy with the example I have there, will work on it a bit and close it out for the next release. For now can we keep this open since it has SPARK-7546 as a dependency? > Examples: Check for new APIs requiring example code > --- > > Key: SPARK-9670 > URL: https://issues.apache.org/jira/browse/SPARK-9670 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Ram Sriharsha >Priority: Minor > > Audit list of new features added to MLlib, and see which major items are > missing example code (in the examples folder). We do not need examples for > everything, only for major items such as new ML algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10251) Some internal spark classes are not registered with kryo
[ https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-10251: - Assignee: Ram Sriharsha Some internal spark classes are not registered with kryo Key: SPARK-10251 URL: https://issues.apache.org/jira/browse/SPARK-10251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Soren Macbeth Assignee: Ram Sriharsha When running a job using kryo serialization and setting `spark.kryo.registrationRequired=true` some internal classes are not registered, causing the job to die. This is still a problem when this setting is false (which is the default) because it makes the space required to store serialized objects in memory or disk much much more expensive in terms of runtime and storage space. {code} 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not registered: scala.Tuple2[] Note: To register this class use: kryo.register(scala.Tuple2[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10251) Some internal spark classes are not registered with kryo
[ https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714161#comment-14714161 ] Ram Sriharsha commented on SPARK-10251: --- as far as I can see, this happens from spark 1.2 onward. haven't gone back yet to see if this was present before spark 1.2 a temporary workaround is to register the necessary classes manually by setting the following conf property: --conf spark.kryo.classesToRegister = [Lscala.Tuple2; I'm looking into a better solution now Some internal spark classes are not registered with kryo Key: SPARK-10251 URL: https://issues.apache.org/jira/browse/SPARK-10251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Soren Macbeth Assignee: Ram Sriharsha When running a job using kryo serialization and setting `spark.kryo.registrationRequired=true` some internal classes are not registered, causing the job to die. This is still a problem when this setting is false (which is the default) because it makes the space required to store serialized objects in memory or disk much much more expensive in terms of runtime and storage space. {code} 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not registered: scala.Tuple2[] Note: To register this class use: kryo.register(scala.Tuple2[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10251) Some internal spark classes are not registered with kryo
[ https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714161#comment-14714161 ] Ram Sriharsha edited comment on SPARK-10251 at 8/26/15 3:36 PM: as far as I can see, this happens from spark 1.2 onward. haven't gone back yet to see if this was present before spark 1.2 a temporary workaround is to register the necessary classes manually by setting the following conf property: --conf spark.kryo.classesToRegister = [Lscala.Tuple2; I'm looking into a better solution now: one option is to automatically register such classes (i.e. arrays of tuples, lists of tuples, None, etc) was (Author: rams): as far as I can see, this happens from spark 1.2 onward. haven't gone back yet to see if this was present before spark 1.2 a temporary workaround is to register the necessary classes manually by setting the following conf property: --conf spark.kryo.classesToRegister = [Lscala.Tuple2; I'm looking into a better solution now Some internal spark classes are not registered with kryo Key: SPARK-10251 URL: https://issues.apache.org/jira/browse/SPARK-10251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Soren Macbeth Assignee: Ram Sriharsha When running a job using kryo serialization and setting `spark.kryo.registrationRequired=true` some internal classes are not registered, causing the job to die. This is still a problem when this setting is false (which is the default) because it makes the space required to store serialized objects in memory or disk much much more expensive in terms of runtime and storage space. {code} 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not registered: scala.Tuple2[] Note: To register this class use: kryo.register(scala.Tuple2[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9670) ML 1.5 QA: Examples: Check for new APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694756#comment-14694756 ] Ram Sriharsha commented on SPARK-9670: -- Hey Yuhao There is already a JIRA to add a complex pipeline example. You can use the same JIRA. https://github.com/apache/spark/pull/6654 What do you have in mind? ML 1.5 QA: Examples: Check for new APIs requiring example code -- Key: SPARK-9670 URL: https://issues.apache.org/jira/browse/SPARK-9670 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Joseph K. Bradley Assignee: Ram Sriharsha Priority: Minor Audit list of new features added to MLlib, and see which major items are missing example code (in the examples folder). We do not need examples for everything, only for major items such as new ML algorithms. For any such items: * Create a JIRA for that feature, and assign it to the author of the feature (or yourself if interested). * Link it to (a) the original JIRA which introduced that feature (related to) and (b) to this JIRA (requires). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7690: - Assignee: Eron Wright (was: Ram Sriharsha) MulticlassClassificationEvaluator for tuning Multiclass Classifiers --- Key: SPARK-7690 URL: https://issues.apache.org/jira/browse/SPARK-7690 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Eron Wright Provide a MulticlassClassificationEvaluator with weighted F1-score to tune multiclass classifiers using Pipeline API. MLLib already provides a MulticlassMetrics functionality which can be wrapped around a MulticlassClassificationEvaluator to expose weighted F1-score as metric. The functionality could be similar to scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) in that we can support micro, macro and weighted versions of the F1-score (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7690: - Shepherd: Ram Sriharsha MulticlassClassificationEvaluator for tuning Multiclass Classifiers --- Key: SPARK-7690 URL: https://issues.apache.org/jira/browse/SPARK-7690 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Eron Wright Provide a MulticlassClassificationEvaluator with weighted F1-score to tune multiclass classifiers using Pipeline API. MLLib already provides a MulticlassMetrics functionality which can be wrapped around a MulticlassClassificationEvaluator to expose weighted F1-score as metric. The functionality could be similar to scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) in that we can support micro, macro and weighted versions of the F1-score (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7546) Example code for ML Pipelines feature transformations
[ https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7546: - Target Version/s: 1.5.0 (was: 1.4.0) Example code for ML Pipelines feature transformations - Key: SPARK-7546 URL: https://issues.apache.org/jira/browse/SPARK-7546 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Ram Sriharsha This should be added for Scala, Java, and Python. It should cover ML Pipelines using a complex series of feature transformations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming
[ https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-8092: - Fix Version/s: 1.4.1 OneVsRest doesn't allow flexibility in label/ feature column renaming - Key: SPARK-8092 URL: https://issues.apache.org/jira/browse/SPARK-8092 Project: Spark Issue Type: Bug Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming
[ https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-8092: - Component/s: ML OneVsRest doesn't allow flexibility in label/ feature column renaming - Key: SPARK-8092 URL: https://issues.apache.org/jira/browse/SPARK-8092 Project: Spark Issue Type: Bug Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7546) Example code for ML Pipelines feature transformations
[ https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-7546: Assignee: Ram Sriharsha Example code for ML Pipelines feature transformations - Key: SPARK-7546 URL: https://issues.apache.org/jira/browse/SPARK-7546 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Ram Sriharsha This should be added for Scala, Java, and Python. It should cover ML Pipelines using a complex series of feature transformations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6013) Add more Python ML examples for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565819#comment-14565819 ] Ram Sriharsha commented on SPARK-6013: -- Cross Validator Example is covered as part of this PR: https://issues.apache.org/jira/browse/SPARK-7387 Add more Python ML examples for spark.ml Key: SPARK-6013 URL: https://issues.apache.org/jira/browse/SPARK-6013 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Ram Sriharsha Now that the spark.ml Pipelines API is supported within Python, we should duplicate the remaining Scala/Java spark.ml examples within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7882) HBase Input Format Example does not allow passing ZK parent node
Ram Sriharsha created SPARK-7882: Summary: HBase Input Format Example does not allow passing ZK parent node Key: SPARK-7882 URL: https://issues.apache.org/jira/browse/SPARK-7882 Project: Spark Issue Type: Bug Reporter: Ram Sriharsha Assignee: Ram Sriharsha Priority: Minor HBase Input Format example here: https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L52 precludes passing a fourth parameter (zk.node.parent) even though down the line there is code checking for a possible fourth parameter and interpreting it as zk.node.parent here : https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L71 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7861) Python wrapper for OneVsRest
Ram Sriharsha created SPARK-7861: Summary: Python wrapper for OneVsRest Key: SPARK-7861 URL: https://issues.apache.org/jira/browse/SPARK-7861 Project: Spark Issue Type: Improvement Reporter: Ram Sriharsha Assignee: Ram Sriharsha -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7833) Add python wrapper for RegressionEvaluator
Ram Sriharsha created SPARK-7833: Summary: Add python wrapper for RegressionEvaluator Key: SPARK-7833 URL: https://issues.apache.org/jira/browse/SPARK-7833 Project: Spark Issue Type: Improvement Reporter: Ram Sriharsha Assignee: Ram Sriharsha Add a python wrapper for RegressionEvaluator in the ML Pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6013) Add more Python ML examples for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-6013: Assignee: Ram Sriharsha Add more Python ML examples for spark.ml Key: SPARK-6013 URL: https://issues.apache.org/jira/browse/SPARK-6013 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Ram Sriharsha Now that the spark.ml Pipelines API is supported within Python, we should duplicate the remaining Scala/Java spark.ml examples within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555326#comment-14555326 ] Ram Sriharsha commented on SPARK-7404: -- ah perfect, didn't notice RegressionMetrics in codebase. that is great! Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Ram Sriharsha This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-7404: Assignee: Ram Sriharsha Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Ram Sriharsha This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285 ] Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:35 PM: scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. was (Author: rams): sickout learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Ram Sriharsha This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285 ] Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:36 PM: scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural first metrics to make available via the Evaluator. was (Author: rams): scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Ram Sriharsha This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285 ] Ram Sriharsha commented on SPARK-7404: -- sickout learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Ram Sriharsha This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-7690: Assignee: Ram Sriharsha MulticlassClassificationEvaluator for tuning Multiclass Classifiers --- Key: SPARK-7690 URL: https://issues.apache.org/jira/browse/SPARK-7690 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Provide a MulticlassClassificationEvaluator with weighted F1-score to tune multiclass classifiers using Pipeline API. MLLib already provides a MulticlassMetrics functionality which can be wrapped around a MulticlassClassificationEvaluator to expose weighted F1-score as metric. The functionality could be similar to scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) in that we can support micro, macro and weighted versions of the F1-score (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
Ram Sriharsha created SPARK-7690: Summary: MulticlassClassificationEvaluator for tuning Multiclass Classifiers Key: SPARK-7690 URL: https://issues.apache.org/jira/browse/SPARK-7690 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Provide a MulticlassClassificationEvaluator with weighted F1-score to tune multiclass classifiers using Pipeline API. MLLib already provides a MulticlassMetrics functionality which can be wrapped around a MulticlassClassificationEvaluator to expose weighted F1-score as metric. The functionality could be similar to scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) in that we can support micro, macro and weighted versions of the F1-score (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7460) Provide DataFrame.zip (analog of RDD.zip) to merge two data frames
Ram Sriharsha created SPARK-7460: Summary: Provide DataFrame.zip (analog of RDD.zip) to merge two data frames Key: SPARK-7460 URL: https://issues.apache.org/jira/browse/SPARK-7460 Project: Spark Issue Type: Sub-task Reporter: Ram Sriharsha Assignee: Ram Sriharsha Priority: Minor an analog of RDD1.zip(RDD2) for data frames allows us to merge two data frames without stepping down to the RDD layer and back. (syntactic sugar) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5866) pyspark read from s3
[ https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686 ] Ram Sriharsha commented on SPARK-5866: -- i'm not sure how the scala version is working..the exception suggests its looking for a path with protocol s3, when it should be s3n:// (The NativeS3FileSystem scheme is s3n) pyspark read from s3 Key: SPARK-5866 URL: https://issues.apache.org/jira/browse/SPARK-5866 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1 Environment: mac OSx and ec2 ubuntu Reporter: venu k tangirala I am trying to read data from s3 via pyspark, I gave the credentials with sc= SparkContext() sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key) sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key) I also tried setting the credentials with core-site.xml and placed in the conf/ dir. Interestingly, the same works with scala version of spark, both by setting the s3 key and secret key in scala code and also by setting it in core-site.xml The pySpark error is as follows : File /Users/myname/path/./spark_json.py, line 55, in module vals_table = sqlContext.inferSchema(values) File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in inferSchema first = rdd.first() File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first rs = self.take(1) File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take totalParts = self._jrdd.partitions().size() File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py, line 538, in __call__ self.target_id, self.name) File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions. : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://bucketName/pathS3/_1417479684 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5866) pyspark read from s3
[ https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686 ] Ram Sriharsha edited comment on SPARK-5866 at 5/5/15 3:44 PM: -- i'm not sure how the scala version is working..the exception suggests its looking for a path with scheme s3, when it should be s3n (The NativeS3FileSystem scheme is s3n) was (Author: rams): i'm not sure how the scala version is working..the exception suggests its looking for a path with protocol s3, when it should be s3n:// (The NativeS3FileSystem scheme is s3n) pyspark read from s3 Key: SPARK-5866 URL: https://issues.apache.org/jira/browse/SPARK-5866 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1 Environment: mac OSx and ec2 ubuntu Reporter: venu k tangirala I am trying to read data from s3 via pyspark, I gave the credentials with sc= SparkContext() sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key) sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key) I also tried setting the credentials with core-site.xml and placed in the conf/ dir. Interestingly, the same works with scala version of spark, both by setting the s3 key and secret key in scala code and also by setting it in core-site.xml The pySpark error is as follows : File /Users/myname/path/./spark_json.py, line 55, in module vals_table = sqlContext.inferSchema(values) File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in inferSchema first = rdd.first() File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first rs = self.take(1) File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take totalParts = self._jrdd.partitions().size() File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py, line 538, in __call__ self.target_id, self.name) File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions. : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://bucketName/pathS3/_1417479684 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5866) pyspark read from s3
[ https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686 ] Ram Sriharsha edited comment on SPARK-5866 at 5/5/15 3:45 PM: -- i'm not sure how the scala version is working..the exception suggests its looking for a path with scheme s3, when it should be s3n (The NativeS3FileSystem scheme is s3n) As far as I can see, you have a typo in your path and this is not a bug. was (Author: rams): i'm not sure how the scala version is working..the exception suggests its looking for a path with scheme s3, when it should be s3n (The NativeS3FileSystem scheme is s3n) pyspark read from s3 Key: SPARK-5866 URL: https://issues.apache.org/jira/browse/SPARK-5866 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1 Environment: mac OSx and ec2 ubuntu Reporter: venu k tangirala I am trying to read data from s3 via pyspark, I gave the credentials with sc= SparkContext() sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key) sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key) I also tried setting the credentials with core-site.xml and placed in the conf/ dir. Interestingly, the same works with scala version of spark, both by setting the s3 key and secret key in scala code and also by setting it in core-site.xml The pySpark error is as follows : File /Users/myname/path/./spark_json.py, line 55, in module vals_table = sqlContext.inferSchema(values) File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in inferSchema first = rdd.first() File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first rs = self.take(1) File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take totalParts = self._jrdd.partitions().size() File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py, line 538, in __call__ self.target_id, self.name) File /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions. : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://bucketName/pathS3/_1417479684 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504197#comment-14504197 ] Ram Sriharsha commented on SPARK-7015: -- sounds good. Let me know what reference you had in mind.. I am familiar with Beygelzimer,Langford's error correcting tournaments http://hunch.net/~beygel/tournament.pdf but if you have a better reference in mind, let me know I can use that as the starting point. Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7014) Support loading VW formatted data.
Ram Sriharsha created SPARK-7014: Summary: Support loading VW formatted data. Key: SPARK-7014 URL: https://issues.apache.org/jira/browse/SPARK-7014 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha Support loading data in VW format. VW format is used fairly widely and has support for namespaces, importance weighting and multi label and cost sensitive multi class classification formats. We can support this just as we support avro and csv formats today . It probably belongs to a new package say spark-vw but this JIRA is simply to track and discuss the issue of supporting VW format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7015) Multiclass to Binary Reduction
Ram Sriharsha created SPARK-7015: Summary: Multiclass to Binary Reduction Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7014) Support loading VW formatted data.
[ https://issues.apache.org/jira/browse/SPARK-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503476#comment-14503476 ] Ram Sriharsha commented on SPARK-7014: -- good point, the priority is minor. the plan is to have it as a separate library code like spark-afro for example. it would be good to know who else is using VW formatted data for example., maybe there isn't enough usage to warrant a parser. Support loading VW formatted data. -- Key: SPARK-7014 URL: https://issues.apache.org/jira/browse/SPARK-7014 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha Assignee: Ram Sriharsha Priority: Minor Original Estimate: 96h Remaining Estimate: 96h Support loading data in VW format. VW format is used fairly widely and has support for namespaces, importance weighting and multi label and cost sensitive multi class classification formats. We can support this just as we support avro and csv formats today . It probably belongs to a new package say spark-vw but this JIRA is simply to track and discuss the issue of supporting VW format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7014) Support loading VW formatted data.
[ https://issues.apache.org/jira/browse/SPARK-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503476#comment-14503476 ] Ram Sriharsha edited comment on SPARK-7014 at 4/20/15 7:36 PM: --- good point, the priority is minor. the plan is to have it as a separate library code like spark-avro for example. it would be good to know who else is using VW formatted data for example., maybe there isn't enough usage to warrant a parser. was (Author: rams): good point, the priority is minor. the plan is to have it as a separate library code like spark-afro for example. it would be good to know who else is using VW formatted data for example., maybe there isn't enough usage to warrant a parser. Support loading VW formatted data. -- Key: SPARK-7014 URL: https://issues.apache.org/jira/browse/SPARK-7014 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha Assignee: Ram Sriharsha Priority: Minor Original Estimate: 96h Remaining Estimate: 96h Support loading data in VW format. VW format is used fairly widely and has support for namespaces, importance weighting and multi label and cost sensitive multi class classification formats. We can support this just as we support avro and csv formats today . It probably belongs to a new package say spark-vw but this JIRA is simply to track and discuss the issue of supporting VW format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org