[jira] [Commented] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253400#comment-15253400
 ] 

Bo Meng commented on SPARK-14840:
-

I am testing against master:
1. I do not think your test is valid, at least it should be:
sqlContext.sql("drop table tmp.order")
2. It works fine if you just add {{`}} to {{order}}, without it, it will throw 
exception.
sqlContext.sql("drop table `order`");
I have ignored {{tmp}} here.

> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> 

[jira] [Commented] (SPARK-12660) Rewrite except using anti-join

2016-04-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253396#comment-15253396
 ] 

Reynold Xin commented on SPARK-12660:
-

Sure. If it can be done now.


> Rewrite except using anti-join
> --
>
> Key: SPARK-12660
> URL: https://issues.apache.org/jira/browse/SPARK-12660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to SPARK-12656, we can rewrite except in the logical level using 
> anti-join.  This way, we can take advantage of all the benefits of join 
> implementations (e.g. managed memory, code generation, broadcast joins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12660) Rewrite except using anti-join

2016-04-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253393#comment-15253393
 ] 

Xiao Li commented on SPARK-12660:
-

If nobody starts this, I can take it?

> Rewrite except using anti-join
> --
>
> Key: SPARK-12660
> URL: https://issues.apache.org/jira/browse/SPARK-12660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to SPARK-12656, we can rewrite except in the logical level using 
> anti-join.  This way, we can take advantage of all the benefits of join 
> implementations (e.g. managed memory, code generation, broadcast joins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12660) Rewrite except using anti-join

2016-04-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253389#comment-15253389
 ] 

Reynold Xin commented on SPARK-12660:
-

cc [~hvanhovell] can we do this now?


> Rewrite except using anti-join
> --
>
> Key: SPARK-12660
> URL: https://issues.apache.org/jira/browse/SPARK-12660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to SPARK-12656, we can rewrite except in the logical level using 
> anti-join.  This way, we can take advantage of all the benefits of join 
> implementations (e.g. managed memory, code generation, broadcast joins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Kwangwoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253384#comment-15253384
 ] 

Kwangwoo Kim commented on SPARK-14840:
--

However, the sql sucessfully works in Hive and there was no problem in 1.4.1. 


> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>

[jira] [Commented] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253376#comment-15253376
 ] 

Bo Meng commented on SPARK-14840:
-

I think because {{order}} is a keyword, please try not to use it.

> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>   at 

[jira] [Comment Edited] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253376#comment-15253376
 ] 

Bo Meng edited comment on SPARK-14840 at 4/22/16 5:34 AM:
--

I think because {{order}} is a keyword, please try not to use it as table name.


was (Author: bomeng):
I think because {{order}} is a keyword, please try not to use it.

> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> 

[jira] [Updated] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-04-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14839:
-
Description: 
This was found in https://github.com/apache/spark/pull/12494.

Currently, Spark SQL does not support other types and {{null}} as a value of an 
options. 

For example, 

{code}
CREATE ...
USING csv
OPTIONS (path "your-path", quote null)
{code}

throws an exception below

{code}
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
org.apache.spark.sql.catalyst.parser.ParseException: 
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
...
{code}

Currently, Scala API supports to take options with the types, {{String}}, 
{{Long}}, {{Double}} and {{Boolean}} and Python API also supports other types. 
I think in this way we can support data sources in a consistent way.

It looks it is okay to  to provide other types as arguments just like 
[Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] because 
[SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] standard 
mentions options as below:

{quote}
An implementation remains conforming even if it provides user op-
tions to process nonconforming SQL language or to process conform-
ing SQL language in a nonconforming manner.
{quote}


  was:
This was found in https://github.com/apache/spark/pull/12494.

Currently, Spark SQL does not support other types and {{null}} as a value of an 
options. 

For example, 

{code}
...
CREATE ...
USING csv
OPTIONS (path "your-path", quote null)
{code}

throws an exception below

{code}
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
org.apache.spark.sql.catalyst.parser.ParseException: 
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
...
{code}

Currently, Scala API supports to take options with the types, {{String}}, 
{{Long}}, {{Double}} and {{Boolean}} and Python API also supports other types. 
I think in this way we can support data sources in a consistent way.

It looks it is okay to  to provide other types as arguments just like 
[Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] because 
[SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] standard 
mentions options as below:

{quote}
An implementation remains conforming even if it provides user op-
tions to process nonconforming SQL language or to process conform-
ing SQL language in a nonconforming manner.
{quote}



> Support for other types as option in OPTIONS clause
> ---
>
> Key: SPARK-14839
> URL: https://issues.apache.org/jira/browse/SPARK-14839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, Spark SQL does not support other types and {{null}} as a 

[jira] [Updated] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Kwangwoo Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kwangwoo Kim updated SPARK-14840:
-
Description: 
sqlContext("drop table tmp.order")  

The above code makes error as following: 
6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
16/04/22 14:27:19 INFO ParseDriver: Parse Completed
16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected

tmp.order
^
java.lang.RuntimeException: [1.5] failure: identifier expected

tmp.order
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
$line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
at $line15.$read$$iwC$$iwC$$iwC.(:39)
at $line15.$read$$iwC$$iwC.(:41)
at $line15.$read$$iwC.(:43)
at $line15.$read.(:45)
at $line15.$read$.(:49)
at $line15.$read$.()
at $line15.$eval$.(:7)
at $line15.$eval$.()
at $line15.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 

[jira] [Created] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-04-21 Thread Kwangwoo Kim (JIRA)
Kwangwoo Kim created SPARK-14840:


 Summary: Cannot drop a table which has the name starting with 'or'
 Key: SPARK-14840
 URL: https://issues.apache.org/jira/browse/SPARK-14840
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Kwangwoo Kim



sqlContext("drop table tmp.order")  

the above code makes error as following: 
6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
16/04/22 14:27:19 INFO ParseDriver: Parse Completed
16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected

tmp.order
^
java.lang.RuntimeException: [1.5] failure: identifier expected

tmp.order
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
$line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
at $line15.$read$$iwC$$iwC$$iwC.(:39)
at $line15.$read$$iwC$$iwC.(:41)
at $line15.$read$$iwC.(:43)
at $line15.$read.(:45)
at $line15.$read$.(:49)
at $line15.$read$.()
at $line15.$eval$.(:7)
at $line15.$eval$.()
at $line15.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Issue Comment Deleted] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2

2016-04-21 Thread Bo Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng updated SPARK-14541:

Comment: was deleted

(was: I will try to do it one by one. )

> SQL function: IFNULL, NULLIF, NVL and NVL2
> --
>
> Key: SPARK-14541
> URL: https://issues.apache.org/jira/browse/SPARK-14541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> It will be great to have these SQL functions:
> IFNULL, NULLIF, NVL, NVL2
> The meaning of these functions could be found in oracle docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-04-21 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-14839:


 Summary: Support for other types as option in OPTIONS clause
 Key: SPARK-14839
 URL: https://issues.apache.org/jira/browse/SPARK-14839
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


This was found in https://github.com/apache/spark/pull/12494.

Currently, Spark SQL does not support other types and {{null}} as a value of an 
options. 

For example, 

{code}
...
CREATE ...
USING csv
OPTIONS (path "your-path", quote null)
{code}

throws an exception below

{code}
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
org.apache.spark.sql.catalyst.parser.ParseException: 
Unsupported SQL statement
== SQL ==
 CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, modelName 
string, comments string, grp string) USING csv OPTIONS (path "your-path", quote 
null)   
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
...
{code}

Currently, Scala API supports to take options with the types, {{String}}, 
{{Long}}, {{Double}} and {{Boolean}} and Python API also supports other types. 
I think in this way we can support data sources in a consistent way.

It looks it is okay to  to provide other types as arguments just like 
[Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] because 
[SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] standard 
mentions options as below:

{quote}
An implementation remains conforming even if it provides user op-
tions to process nonconforming SQL language or to process conform-
ing SQL language in a nonconforming manner.
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10001:
---
Assignee: Jakob Odersky

> Allow Ctrl-C in spark-shell to kill running job
> ---
>
> Key: SPARK-10001
> URL: https://issues.apache.org/jira/browse/SPARK-10001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Cheolsoo Park
>Assignee: Jakob Odersky
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if 
> spark-shell also can do that. Otherwise, in case a user submits a job, say he 
> made a mistake, and wants to cancel it, he needs to exit the shell and 
> re-login to continue his work. Re-login can be a pain especially in Spark on 
> yarn, since it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10001.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12557
[https://github.com/apache/spark/pull/12557]

> Allow Ctrl-C in spark-shell to kill running job
> ---
>
> Key: SPARK-10001
> URL: https://issues.apache.org/jira/browse/SPARK-10001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Cheolsoo Park
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if 
> spark-shell also can do that. Otherwise, in case a user submits a job, say he 
> made a mistake, and wants to cancel it, he needs to exit the shell and 
> re-login to continue his work. Re-login can be a pain especially in Spark on 
> yarn, since it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14791) TPCDS Q23B generate different result each time

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14791:


Assignee: Apache Spark  (was: Davies Liu)

> TPCDS Q23B generate different result each time
> --
>
> Key: SPARK-14791
> URL: https://issues.apache.org/jira/browse/SPARK-14791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Blocker
>
> Sometimes the number of rows of some operators will become zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14791) TPCDS Q23B generate different result each time

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253346#comment-15253346
 ] 

Apache Spark commented on SPARK-14791:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12600

> TPCDS Q23B generate different result each time
> --
>
> Key: SPARK-14791
> URL: https://issues.apache.org/jira/browse/SPARK-14791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> Sometimes the number of rows of some operators will become zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14791) TPCDS Q23B generate different result each time

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14791:


Assignee: Davies Liu  (was: Apache Spark)

> TPCDS Q23B generate different result each time
> --
>
> Key: SPARK-14791
> URL: https://issues.apache.org/jira/browse/SPARK-14791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> Sometimes the number of rows of some operators will become zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14791) TPCDS Q23B generate different result each time

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14791:
--

Assignee: Davies Liu

> TPCDS Q23B generate different result each time
> --
>
> Key: SPARK-14791
> URL: https://issues.apache.org/jira/browse/SPARK-14791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> Sometimes the number of rows of some operators will become zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14835) Remove MetastoreRelation dependency from SQLBuilder

2016-04-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14835.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove MetastoreRelation dependency from SQLBuilder
> ---
>
> Key: SPARK-14835
> URL: https://issues.apache.org/jira/browse/SPARK-14835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14369) Implement preferredLocations() for FileScanRDD

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14369.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12527
[https://github.com/apache/spark/pull/12527]

> Implement preferredLocations() for FileScanRDD
> --
>
> Key: SPARK-14369
> URL: https://issues.apache.org/jira/browse/SPARK-14369
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> Implement {{FileScanRDD.preferredLocations()}} to add locality support for 
> {{HadoopFsRelation}} based data sources.
> We should avoid extra block location related RPC costs for S3, which doesn't 
> provide valid locality information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14838) Skip automatically broadcast a plan when it contains ObjectProducer

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253334#comment-15253334
 ] 

Apache Spark commented on SPARK-14838:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/12599

> Skip automatically broadcast a plan when it contains ObjectProducer
> ---
>
> Key: SPARK-14838
> URL: https://issues.apache.org/jira/browse/SPARK-14838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Spark will determine the plan size to automatically broadcast it or not when 
> doing join. As it can't estimate object type size, this mechanism will throw 
> failure as shown in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
>  We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14838) Skip automatically broadcast a plan when it contains ObjectProducer

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14838:


Assignee: (was: Apache Spark)

> Skip automatically broadcast a plan when it contains ObjectProducer
> ---
>
> Key: SPARK-14838
> URL: https://issues.apache.org/jira/browse/SPARK-14838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Spark will determine the plan size to automatically broadcast it or not when 
> doing join. As it can't estimate object type size, this mechanism will throw 
> failure as shown in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
>  We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14838) Skip automatically broadcast a plan when it contains ObjectProducer

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14838:


Assignee: Apache Spark

> Skip automatically broadcast a plan when it contains ObjectProducer
> ---
>
> Key: SPARK-14838
> URL: https://issues.apache.org/jira/browse/SPARK-14838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Spark will determine the plan size to automatically broadcast it or not when 
> doing join. As it can't estimate object type size, this mechanism will throw 
> failure as shown in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
>  We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14838) Skip automatically broadcast a plan when it contains ObjectProducer

2016-04-21 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-14838:
---

 Summary: Skip automatically broadcast a plan when it contains 
ObjectProducer
 Key: SPARK-14838
 URL: https://issues.apache.org/jira/browse/SPARK-14838
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Spark will determine the plan size to automatically broadcast it or not when 
doing join. As it can't estimate object type size, this mechanism will throw 
failure as shown in 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
 We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14680) Support all datatypes to use VectorizedHashmap in TungstenAggregate

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14680:
---
Assignee: Sameer Agarwal

> Support all datatypes to use VectorizedHashmap in TungstenAggregate
> ---
>
> Key: SPARK-14680
> URL: https://issues.apache.org/jira/browse/SPARK-14680
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14680) Support all datatypes to use VectorizedHashmap in TungstenAggregate

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14680.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12440
[https://github.com/apache/spark/pull/12440]

> Support all datatypes to use VectorizedHashmap in TungstenAggregate
> ---
>
> Key: SPARK-14680
> URL: https://issues.apache.org/jira/browse/SPARK-14680
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14793) Code generation for large complex type exceeds JVM size limit.

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14793:
---
Assignee: Takuya Ueshin

> Code generation for large complex type exceeds JVM size limit.
> --
>
> Key: SPARK-14793
> URL: https://issues.apache.org/jira/browse/SPARK-14793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> Code generation for complex type, {{CreateArray}}, {{CreateMap}}, 
> {{CreateStruct}}, {{CreateNamedStruct}}, exceeds JVM size limit for large 
> elements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14793) Code generation for large complex type exceeds JVM size limit.

2016-04-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14793.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12559
[https://github.com/apache/spark/pull/12559]

> Code generation for large complex type exceeds JVM size limit.
> --
>
> Key: SPARK-14793
> URL: https://issues.apache.org/jira/browse/SPARK-14793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
> Fix For: 2.0.0
>
>
> Code generation for complex type, {{CreateArray}}, {{CreateMap}}, 
> {{CreateStruct}}, {{CreateNamedStruct}}, exceeds JVM size limit for large 
> elements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253285#comment-15253285
 ] 

Felix Cheung commented on SPARK-14831:
--

I'd argue it is more important that they are like the existing R functions? 
Granted they are not consistent and they don't always match what Spark support, 
but I think we are expecting a large number of long time R users who are very 
familiar with how to call kmeans, to try to use Spark.

However, take kmeans as an example, these are S4 methods, it should be possible 
to define them in such a way that they would look like R's kmeans by default, 
for example
{code}
setMethod("kmeans", signature(x = "DataFrame"),
  function(x, centers, iter.max = 10, algorithm = c("random", 
"k-means||"))
{code}

could be changed to as you later suggested (DataFrame to follow by Formula)
{code}
setMethod("kmeans", signature(data = "DataFrame"),
  function(data, formula = NULL, centers, iter.max = 10, algorithm = 
c("random", "k-means||"))
{code}


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253281#comment-15253281
 ] 

Jeff Zhang edited comment on SPARK-14834 at 4/22/16 3:43 AM:
-

No, just intend to do it in python annotation decorator. See context here 
https://github.com/apache/spark/pull/10242/files


was (Author: zjffdu):
No, just intend to do it in python side. See context here 
https://github.com/apache/spark/pull/10242/files

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253281#comment-15253281
 ] 

Jeff Zhang commented on SPARK-14834:


No, just intend to do it in python side. 
https://github.com/apache/spark/pull/10242/files

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253281#comment-15253281
 ] 

Jeff Zhang edited comment on SPARK-14834 at 4/22/16 3:42 AM:
-

No, just intend to do it in python side. See context here 
https://github.com/apache/spark/pull/10242/files


was (Author: zjffdu):
No, just intend to do it in python side. 
https://github.com/apache/spark/pull/10242/files

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-14834:
---
Priority: Minor  (was: Major)

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253278#comment-15253278
 ] 

holdenk commented on SPARK-14834:
-

Just to be clear this is adding a linter rule yes?

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14837) Add support in file stream source for reading new files added to subdirs

2016-04-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14837:
-

 Summary: Add support in file stream source for reading new files 
added to subdirs
 Key: SPARK-14837
 URL: https://issues.apache.org/jira/browse/SPARK-14837
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253264#comment-15253264
 ] 

Apache Spark commented on SPARK-14521:
--

User 'yzhou2001' has created a pull request for this issue:
https://github.com/apache/spark/pull/12598

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14836) Zip local jars before uploading to distributed cache

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253254#comment-15253254
 ] 

Apache Spark commented on SPARK-14836:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/12597

> Zip local jars before uploading to distributed cache
> 
>
> Key: SPARK-14836
> URL: https://issues.apache.org/jira/browse/SPARK-14836
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently if neither {{spark.yarn.jars}} nor {{spark.yarn.archive}} is set 
> (by default), Spark on yarn code will upload all the jars in the folder 
> separately into distributed cache, this is quite time consuming, and very 
> verbose, instead of upload jars separately into distributed cache, here 
> changes to zip all the jars first, and then put into distributed cache.
> This will significantly improve the speed of starting time, in my local 
> machine, it could save around 5 seconds for the starting period, not to say a 
> real cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14836) Zip local jars before uploading to distributed cache

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14836:


Assignee: (was: Apache Spark)

> Zip local jars before uploading to distributed cache
> 
>
> Key: SPARK-14836
> URL: https://issues.apache.org/jira/browse/SPARK-14836
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently if neither {{spark.yarn.jars}} nor {{spark.yarn.archive}} is set 
> (by default), Spark on yarn code will upload all the jars in the folder 
> separately into distributed cache, this is quite time consuming, and very 
> verbose, instead of upload jars separately into distributed cache, here 
> changes to zip all the jars first, and then put into distributed cache.
> This will significantly improve the speed of starting time, in my local 
> machine, it could save around 5 seconds for the starting period, not to say a 
> real cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14836) Zip local jars before uploading to distributed cache

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14836:


Assignee: Apache Spark

> Zip local jars before uploading to distributed cache
> 
>
> Key: SPARK-14836
> URL: https://issues.apache.org/jira/browse/SPARK-14836
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently if neither {{spark.yarn.jars}} nor {{spark.yarn.archive}} is set 
> (by default), Spark on yarn code will upload all the jars in the folder 
> separately into distributed cache, this is quite time consuming, and very 
> verbose, instead of upload jars separately into distributed cache, here 
> changes to zip all the jars first, and then put into distributed cache.
> This will significantly improve the speed of starting time, in my local 
> machine, it could save around 5 seconds for the starting period, not to say a 
> real cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253251#comment-15253251
 ] 

Apache Spark commented on SPARK-14829:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12596

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14829:


Assignee: (was: Apache Spark)

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2016-04-21 Thread valgrind_girl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253250#comment-15253250
 ] 

valgrind_girl commented on SPARK-11227:
---

we run into the same problem at spark 1.6.1(we are using 
sparkContext.textfile)。and it only occurs at spark-submit,while the same codes 
work fine at spark-shell。

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> 

[jira] [Assigned] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14829:


Assignee: Apache Spark

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253245#comment-15253245
 ] 

zhengruifeng commented on SPARK-14829:
--

[~josephkb]  I am working on this.

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14750) Make historyServer refer application log in hdfs

2016-04-21 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253237#comment-15253237
 ] 

SuYan commented on SPARK-14750:
---

 
 
yarn.log-aggregation-enable
true
   // if this =true, means application log in container will recycle 
in aggregated hdfs folder.

  
yarn.log-aggregation.retain-seconds
259200
// this means how long the application will retained in 
aggregated hdfs folder.

  
yarn.nodemanager.delete.debug-delay-sec
600
// this means, for some debug requirement, how long application 
log retained in nodemanager folder even if the application logs had recycled in 
aggregated folder.




  
yarn.nodemanager.log.retain-seconds
86400
   // this property is for log-aggregate = false. how long will the 
log retained in the nodemanager folder.


eh...I found some improvements point for that PR, I need consider aggregate = 
false, while the yarn.nodemanager.log.retain-seconds = some large value.




> Make historyServer refer application log in hdfs
> 
>
> Key: SPARK-14750
> URL: https://issues.apache.org/jira/browse/SPARK-14750
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: SuYan
>
> Make history server refer application log, just like MR history server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14804:


Assignee: (was: Apache Spark)

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253230#comment-15253230
 ] 

Apache Spark commented on SPARK-14804:
--

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/12576

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14804:


Assignee: Apache Spark

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253229#comment-15253229
 ] 

Felix Cheung commented on SPARK-14594:
--

Not sure if this was specific to the data types of "the_table", but yea, it 
works if I try`

{code}
> df <- createDataFrame(sqlContext, iris)
> rdd<-SparkR:::toRDD(df)
> gb<-SparkR:::groupByKey(rdd, 1000)
> first(gb)

[[1]]
[1] 4.3

[[2]]
[[2]][[1]]
[1] 3
{code}

perhaps try
{code}
t <- table(sqlContext, 'the_table')
printSchema(t)
{code}

And see what it looks like? Also, is "the_table" from the hive context?


> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253227#comment-15253227
 ] 

Gayathri Murali commented on SPARK-14314:
-

[~mengxr] Yes.

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14159) StringIndexerModel sets output column metadata incorrectly

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253215#comment-15253215
 ] 

Apache Spark commented on SPARK-14159:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12595

> StringIndexerModel sets output column metadata incorrectly
> --
>
> Key: SPARK-14159
> URL: https://issues.apache.org/jira/browse/SPARK-14159
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> StringIndexerModel.transform sets the output column metadata to use name 
> inputCol.  It should not.  Fixing this causes a problem with the metadata 
> produced by RFormula.
> Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and 
> I modified VectorAttributeRewriter to find and replace all "prefixes" since 
> attributes collect multiple prefixes from StringIndexer + Interaction.
> Note that "prefixes" is no longer accurate since internal strings may be 
> replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14753) remove internal flag in Accumulable

2016-04-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14753:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14626

> remove internal flag in Accumulable
> ---
>
> Key: SPARK-14753
> URL: https://issues.apache.org/jira/browse/SPARK-14753
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14836) Zip local jars before uploading to distributed cache

2016-04-21 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-14836:
---

 Summary: Zip local jars before uploading to distributed cache
 Key: SPARK-14836
 URL: https://issues.apache.org/jira/browse/SPARK-14836
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.0.0
Reporter: Saisai Shao
Priority: Minor


Currently if neither {{spark.yarn.jars}} nor {{spark.yarn.archive}} is set (by 
default), Spark on yarn code will upload all the jars in the folder separately 
into distributed cache, this is quite time consuming, and very verbose, instead 
of upload jars separately into distributed cache, here changes to zip all the 
jars first, and then put into distributed cache.

This will significantly improve the speed of starting time, in my local 
machine, it could save around 5 seconds for the starting period, not to say a 
real cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14340) Add Scala Example and Description for ml.BisectingKMeans

2016-04-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253189#comment-15253189
 ] 

zhengruifeng commented on SPARK-14340:
--

Doument for BisectingKMeans

> Add Scala Example and Description for ml.BisectingKMeans
> 
>
> Key: SPARK-14340
> URL: https://issues.apache.org/jira/browse/SPARK-14340
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, add BisectingKMeans to ml-clustering.md
> 2, add the missing Scala BisectingKMeansExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9612) Add instance weight support for GBTs

2016-04-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253185#comment-15253185
 ] 

Joseph K. Bradley commented on SPARK-9612:
--

Removing target version, but please update as needed [~dbtsai]

> Add instance weight support for GBTs
> 
>
> Key: SPARK-9612
> URL: https://issues.apache.org/jira/browse/SPARK-9612
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Minor
>
> GBT support for instance weights could be handled by:
> * sampling data before passing it to trees
> * passing weights to trees (requiring weight support for trees first, but 
> probably better in the end)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9612) Add instance weight support for GBTs

2016-04-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9612:
-
Target Version/s:   (was: 2.0.0)

> Add instance weight support for GBTs
> 
>
> Key: SPARK-9612
> URL: https://issues.apache.org/jira/browse/SPARK-9612
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Minor
>
> GBT support for instance weights could be handled by:
> * sampling data before passing it to trees
> * passing weights to trees (requiring weight support for trees first, but 
> probably better in the end)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2016-04-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253182#comment-15253182
 ] 

Joseph K. Bradley commented on SPARK-10408:
---

I'm going to remove the target version since this won't make 2.0.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2016-04-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10408:
--
Target Version/s:   (was: 2.0.0)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14835) Remove MetastoreRelation dependency from SQLBuilder

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14835:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove MetastoreRelation dependency from SQLBuilder
> ---
>
> Key: SPARK-14835
> URL: https://issues.apache.org/jira/browse/SPARK-14835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14835) Remove MetastoreRelation dependency from SQLBuilder

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253174#comment-15253174
 ] 

Apache Spark commented on SPARK-14835:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12594

> Remove MetastoreRelation dependency from SQLBuilder
> ---
>
> Key: SPARK-14835
> URL: https://issues.apache.org/jira/browse/SPARK-14835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14835) Remove MetastoreRelation dependency from SQLBuilder

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14835:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove MetastoreRelation dependency from SQLBuilder
> ---
>
> Key: SPARK-14835
> URL: https://issues.apache.org/jira/browse/SPARK-14835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14835) Remove MetastoreRelation dependency from SQLBuilder

2016-04-21 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14835:
---

 Summary: Remove MetastoreRelation dependency from SQLBuilder
 Key: SPARK-14835
 URL: https://issues.apache.org/jira/browse/SPARK-14835
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14732) spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253168#comment-15253168
 ] 

Apache Spark commented on SPARK-14732:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12593

> spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian
> 
>
> Key: SPARK-14732
> URL: https://issues.apache.org/jira/browse/SPARK-14732
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> {{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
> returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy 
> the MultivariateGaussian class into spark.ml to avoid referencing spark.mllib 
> types publicly.
> I'll put it in mllib-local under 
> {{spark.ml.stat.distribution.MultivariateGaussian}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14732) spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14732:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian
> 
>
> Key: SPARK-14732
> URL: https://issues.apache.org/jira/browse/SPARK-14732
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> {{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
> returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy 
> the MultivariateGaussian class into spark.ml to avoid referencing spark.mllib 
> types publicly.
> I'll put it in mllib-local under 
> {{spark.ml.stat.distribution.MultivariateGaussian}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14732) spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14732:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> spark.ml GaussianMixture should not use spark.mllib MultivariateGaussian
> 
>
> Key: SPARK-14732
> URL: https://issues.apache.org/jira/browse/SPARK-14732
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> {{org.apache.spark.ml.clustering.GaussianMixtureModel.gaussians}} currently 
> returns the {{MultivariateGaussian}} type from spark.mllib.  We should copy 
> the MultivariateGaussian class into spark.ml to avoid referencing spark.mllib 
> types publicly.
> I'll put it in mllib-local under 
> {{spark.ml.stat.distribution.MultivariateGaussian}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-14834:
---
Summary: Force adding doc for new api in pyspark with @since annotation  
(was: Require doc for new api in pyspark with @since annotation)

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14766) Attribute reference mismatch with Dataset filter + mapPartitions

2016-04-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253165#comment-15253165
 ] 

Wenchen Fan commented on SPARK-14766:
-

Hi [~brkyvz], can you verify if this bug still exists? I can't reproduce it on 
master.

> Attribute reference mismatch with Dataset filter + mapPartitions
> 
>
> Key: SPARK-14766
> URL: https://issues.apache.org/jira/browse/SPARK-14766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> After a filter, the Dataset references seem to be not copied properly leading 
> to an exception. To reproduce, you may use the following code:
> {code}
> Seq((1, 1)).toDS().filter(_._1 != 0).mapPartitions { iter => iter }.count()
> {code}
> Using explain shows the problem:
> {code}
> == Physical Plan ==
> !MapPartitions , newInstance(class scala.Tuple2), [input[0, 
> scala.Tuple2]._1 AS _1#38521,input[0, scala.Tuple2]._2 AS _2#38522]
> +- WholeStageCodegen
>:  +- Filter .apply
>: +- INPUT
>+- LocalTableScan [_1#38512,_2#38513], [[0,1,1]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2016-04-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9961:
-
Target Version/s:   (was: 2.0.0)

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to use other evaluators or 
> evaluator options.
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11559) Make `runs` no effect in k-means

2016-04-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253162#comment-15253162
 ] 

Joseph K. Bradley commented on SPARK-11559:
---

[~yanboliang] Let's separate out this issue from your current large PR.  Could 
you please send a separate PR to disable "runs?"  Thanks!

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 2.0, we can either remove 
> `runs` or make it no effect (with warning messages). So we can simplify the 
> implementation. I prefer the latter for better binary compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0

2016-04-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13448:
--
Description: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
spark.mllib
* SPARK-14768: Remove expectedType arg for PySpark Param

  was:
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
spark.mllib
* SPARK-14768: Remove expectedType arg for PySpark Param
** (*pending further discussion*)


> Document MLlib behavior changes in Spark 2.0
> 
>
> Key: SPARK-13448
> URL: https://issues.apache.org/jira/browse/SPARK-13448
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
> remember to add them to the migration guide / release notes.
> * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
> to 1e-6.
> * SPARK-7780: Intercept will not be regularized if users train binary 
> classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, 
> because it calls ML LogisticRegresson implementation. Meanwhile if users set 
> without regularization, training with or without feature scaling will return 
> the same solution by the same convergence rate(because they run the same code 
> route), this behavior is different from the old API.
> * SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
> results
> * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
> default, if checkpointing is being used.
> * SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
> not handle them correctly.
> * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
> spark.mllib
> * SPARK-14768: Remove expectedType arg for PySpark Param



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14834) Require doc for new api in pyspark with @since annotation

2016-04-21 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-14834:
--

 Summary: Require doc for new api in pyspark with @since annotation
 Key: SPARK-14834
 URL: https://issues.apache.org/jira/browse/SPARK-14834
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253140#comment-15253140
 ] 

Xiangrui Meng commented on SPARK-14314:
---

Please hold until the naive Bayes one gets merged.

On Thu, Apr 21, 2016, 10:19 AM Gayathri Murali (JIRA) 



> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13330) PYTHONHASHSEED is not propgated to python worker

2016-04-21 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13330:
---
Summary: PYTHONHASHSEED is not propgated to python worker  (was: 
PYTHONHASHSEED is not propgated to executor)

> PYTHONHASHSEED is not propgated to python worker
> 
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253120#comment-15253120
 ] 

Apache Spark commented on SPARK-14833:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/12592

> Refactor StreamTests to test for source fault-tolerance correctly.
> --
>
> Key: SPARK-14833
> URL: https://issues.apache.org/jira/browse/SPARK-14833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Current StreamTest allows testing of a streaming Dataset generated explicitly 
> wraps a source. This is different from the actual production code path where 
> the source object is dynamically created through a DataSource object every 
> time a query is started. So all the fault-tolerance testing in 
> FileSourceSuite and FileSourceStressSuite is not really testing the actual 
> code path as they are just reusing the FileStreamSource object. 
> Instead of maintaining a mapping of source --> expected offset in StreamTest 
> (which requires reuse of source object), it should maintain a mapping of 
> source index --> offset, so that it is independent of the source object. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14833:


Assignee: Apache Spark  (was: Tathagata Das)

> Refactor StreamTests to test for source fault-tolerance correctly.
> --
>
> Key: SPARK-14833
> URL: https://issues.apache.org/jira/browse/SPARK-14833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Current StreamTest allows testing of a streaming Dataset generated explicitly 
> wraps a source. This is different from the actual production code path where 
> the source object is dynamically created through a DataSource object every 
> time a query is started. So all the fault-tolerance testing in 
> FileSourceSuite and FileSourceStressSuite is not really testing the actual 
> code path as they are just reusing the FileStreamSource object. 
> Instead of maintaining a mapping of source --> expected offset in StreamTest 
> (which requires reuse of source object), it should maintain a mapping of 
> source index --> offset, so that it is independent of the source object. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14833:


Assignee: Tathagata Das  (was: Apache Spark)

> Refactor StreamTests to test for source fault-tolerance correctly.
> --
>
> Key: SPARK-14833
> URL: https://issues.apache.org/jira/browse/SPARK-14833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Current StreamTest allows testing of a streaming Dataset generated explicitly 
> wraps a source. This is different from the actual production code path where 
> the source object is dynamically created through a DataSource object every 
> time a query is started. So all the fault-tolerance testing in 
> FileSourceSuite and FileSourceStressSuite is not really testing the actual 
> code path as they are just reusing the FileStreamSource object. 
> Instead of maintaining a mapping of source --> expected offset in StreamTest 
> (which requires reuse of source object), it should maintain a mapping of 
> source index --> offset, so that it is independent of the source object. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.

2016-04-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14833:
-

 Summary: Refactor StreamTests to test for source fault-tolerance 
correctly.
 Key: SPARK-14833
 URL: https://issues.apache.org/jira/browse/SPARK-14833
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das


Current StreamTest allows testing of a streaming Dataset generated explicitly 
wraps a source. This is different from the actual production code path where 
the source object is dynamically created through a DataSource object every time 
a query is started. So all the fault-tolerance testing in FileSourceSuite and 
FileSourceStressSuite is not really testing the actual code path as they are 
just reusing the FileStreamSource object. 

Instead of maintaining a mapping of source --> expected offset in StreamTest 
(which requires reuse of source object), it should maintain a mapping of source 
index --> offset, so that it is independent of the source object. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14824) Rename object HiveContext to something else

2016-04-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14824.
-
   Resolution: Fixed
Fix Version/s: 2/

> Rename object HiveContext to something else
> ---
>
> Key: SPARK-14824
> URL: https://issues.apache.org/jira/browse/SPARK-14824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2/
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14824) Rename object HiveContext to something else

2016-04-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14824:

Fix Version/s: (was: 2/)
   2.0.0

> Rename object HiveContext to something else
> ---
>
> Key: SPARK-14824
> URL: https://issues.apache.org/jira/browse/SPARK-14824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14832:


Assignee: Apache Spark  (was: Tathagata Das)

> Refactor DataSource to ensure schema is inferred only once when creating a 
> file stream
> --
>
> Key: SPARK-14832
> URL: https://issues.apache.org/jira/browse/SPARK-14832
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> When creating a file stream using sqlContext.write.stream(), existing files 
> are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the 
> DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in 
> DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating 
> the dataframe, and when the streaming source is created, it should just reuse 
> that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14832:


Assignee: Tathagata Das  (was: Apache Spark)

> Refactor DataSource to ensure schema is inferred only once when creating a 
> file stream
> --
>
> Key: SPARK-14832
> URL: https://issues.apache.org/jira/browse/SPARK-14832
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When creating a file stream using sqlContext.write.stream(), existing files 
> are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the 
> DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in 
> DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating 
> the dataframe, and when the streaming source is created, it should just reuse 
> that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253107#comment-15253107
 ] 

Apache Spark commented on SPARK-14832:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/12591

> Refactor DataSource to ensure schema is inferred only once when creating a 
> file stream
> --
>
> Key: SPARK-14832
> URL: https://issues.apache.org/jira/browse/SPARK-14832
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When creating a file stream using sqlContext.write.stream(), existing files 
> are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the 
> DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in 
> DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating 
> the dataframe, and when the streaming source is created, it should just reuse 
> that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-14832:
--
Description: 
When creating a file stream using sqlContext.write.stream(), existing files are 
scanned twice for finding the schema 
- Once, when creating a DataSource + StreamingRelation in the 
DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in 
DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the 
dataframe, and when the streaming source is created, it should just reuse that 
schema

  was:
When creating a file stream using sqlContext.write.stream(), existing files are 
scanned twice for finding the schema 
- Once, when creating a DataSource + StreamingRelation in the 
DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in 
DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the 
dataframe, and when the streaming source is created, it should just reuse that 
schame


> Refactor DataSource to ensure schema is inferred only once when creating a 
> file stream
> --
>
> Key: SPARK-14832
> URL: https://issues.apache.org/jira/browse/SPARK-14832
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When creating a file stream using sqlContext.write.stream(), existing files 
> are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the 
> DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in 
> DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating 
> the dataframe, and when the streaming source is created, it should just reuse 
> that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14831:
--
Description: 
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(df, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]

  was:
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14832:
-

 Summary: Refactor DataSource to ensure schema is inferred only 
once when creating a file stream
 Key: SPARK-14832
 URL: https://issues.apache.org/jira/browse/SPARK-14832
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


When creating a file stream using sqlContext.write.stream(), existing files are 
scanned twice for finding the schema 
- Once, when creating a DataSource + StreamingRelation in the 
DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in 
DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the 
dataframe, and when the streaming source is created, it should just reuse that 
schame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14831:
--
Description: 
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]

  was:
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(data, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14716) Add partitioned parquet support file stream sink

2016-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-14716:
-

Assignee: Tathagata Das

> Add partitioned parquet support file stream sink
> 
>
> Key: SPARK-14716
> URL: https://issues.apache.org/jira/browse/SPARK-14716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14555) Python API for methods introduced for Structured Streaming

2016-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-14555:
-

Assignee: Burak Yavuz

> Python API for methods introduced for Structured Streaming
> --
>
> Key: SPARK-14555
> URL: https://issues.apache.org/jira/browse/SPARK-14555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.0
>
>
> Methods added for Structured Streaming don't have a Python API yet.
> We need to provide APIs for the new methods in:
>  - DataFrameReader
>  - DataFrameWriter
>  - ContinuousQuery
>  - Trigger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14831:
-

 Summary: Make ML APIs in SparkR consistent
 Key: SPARK-14831
 URL: https://issues.apache.org/jira/browse/SPARK-14831
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2016-04-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253095#comment-15253095
 ] 

Joseph K. Bradley commented on SPARK-12488:
---

Ping [~ilganeli] I hope this is fixed now!

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14802) Disable Passing to Hive the queries that can't be parsed

2016-04-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253093#comment-15253093
 ] 

Reynold Xin commented on SPARK-14802:
-

+1

> Disable Passing to Hive the queries that can't be parsed
> 
>
> Key: SPARK-14802
> URL: https://issues.apache.org/jira/browse/SPARK-14802
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When hitting the query that can't be parsed, we pass it to Hive. Thus, we hit 
> some strange error messages from Hive. We should disable it after we have 
> integrated the SparkSqlParser & HiveSqlParser.
> For example,  
> {code}
> NoViableAltException(302@[192:1: tableName : (db= identifier DOT tab= 
> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME 
> $tab) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:116)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45920)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14777) Finally, merge HiveSqlAstBuilder and SparkSqlAstBuilder

2016-04-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14777.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> Finally, merge HiveSqlAstBuilder and SparkSqlAstBuilder
> ---
>
> Key: SPARK-14777
> URL: https://issues.apache.org/jira/browse/SPARK-14777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14821) Move analyze table parsing into SparkSqlAstBuilder

2016-04-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14821.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move analyze table parsing into SparkSqlAstBuilder
> --
>
> Key: SPARK-14821
> URL: https://issues.apache.org/jira/browse/SPARK-14821
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14479) GLM supports output link prediction

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14479.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12287
[https://github.com/apache/spark/pull/12287]

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM supports output link prediction

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14479:
--
Assignee: Yanbo Liang

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14830) Add RemoveRepetitionFromGroupExpressions optimizer

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14830:


Assignee: Apache Spark

> Add RemoveRepetitionFromGroupExpressions optimizer
> --
>
> Key: SPARK-14830
> URL: https://issues.apache.org/jira/browse/SPARK-14830
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to optimize GroupExpressions by removing repeating 
> expressions. 
> **Before**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, a#5, a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], 
> output=[a#5,a#5,a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14830) Add RemoveRepetitionFromGroupExpressions optimizer

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253047#comment-15253047
 ] 

Apache Spark commented on SPARK-14830:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12590

> Add RemoveRepetitionFromGroupExpressions optimizer
> --
>
> Key: SPARK-14830
> URL: https://issues.apache.org/jira/browse/SPARK-14830
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>
> This issue aims to optimize GroupExpressions by removing repeating 
> expressions. 
> **Before**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, a#5, a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], 
> output=[a#5,a#5,a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14830) Add RemoveRepetitionFromGroupExpressions optimizer

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14830:


Assignee: (was: Apache Spark)

> Add RemoveRepetitionFromGroupExpressions optimizer
> --
>
> Key: SPARK-14830
> URL: https://issues.apache.org/jira/browse/SPARK-14830
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>
> This issue aims to optimize GroupExpressions by removing repeating 
> expressions. 
> **Before**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, a#5, a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5,a#5,a#5], functions=[], 
> output=[a#5,a#5,a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a from (select explode(array(1,2)) a) T group by a, a, 
> a").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
> : +- INPUT
> +- Exchange hashpartitioning(a#5, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[a#5], functions=[], output=[a#5])
>   : +- INPUT
>   +- Generate explode([1,2]), false, false, [a#5]
>  +- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14828) Start SparkSession in REPL instead of SQLContext

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14828:


Assignee: Apache Spark  (was: Andrew Or)

> Start SparkSession in REPL instead of SQLContext
> 
>
> Key: SPARK-14828
> URL: https://issues.apache.org/jira/browse/SPARK-14828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14828) Start SparkSession in REPL instead of SQLContext

2016-04-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14828:


Assignee: Andrew Or  (was: Apache Spark)

> Start SparkSession in REPL instead of SQLContext
> 
>
> Key: SPARK-14828
> URL: https://issues.apache.org/jira/browse/SPARK-14828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >