[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-10-15 Thread shijinkui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649972#comment-16649972
 ] 

shijinkui commented on SPARK-24630:
---

I prefer without stream keyword. Because in the future bath and stream API must 
unify. 

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16977) start spark-shell with yarn mode error

2016-08-09 Thread shijinkui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui closed SPARK-16977.
-
Resolution: Won't Fix

> start spark-shell with yarn mode error
> --
>
> Key: SPARK-16977
> URL: https://issues.apache.org/jira/browse/SPARK-16977
> Project: Spark
>  Issue Type: Temp
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: shijinkui
>
> spark 2.0.0
> yarn 1.7.2
> bin/spark-shell --master yarn
> error:
> 16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has 
> already exited with state FINISHED!
> 16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalStateException: Spark context stopped while waiting for 
> backend
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162)
> at org.apache.spark.SparkContext.(SparkContext.scala:549)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> at $line3.$read$$iw$$iw.(:15)
> at $line3.$read$$iw.(:31)
> at $line3.$read.(:33)
> at $line3.$read$.(:37)
> at $line3.$read$.()
> at $line3.$eval$.$print$lzycompute(:7)
> at $line3.$eval$.$print(:6)
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
> at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
> at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
> at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> 

[jira] [Updated] (SPARK-16977) start spark-shell with yarn mode error

2016-08-09 Thread shijinkui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui updated SPARK-16977:
--
Issue Type: Temp  (was: Bug)

> start spark-shell with yarn mode error
> --
>
> Key: SPARK-16977
> URL: https://issues.apache.org/jira/browse/SPARK-16977
> Project: Spark
>  Issue Type: Temp
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: shijinkui
>
> spark 2.0.0
> yarn 1.7.2
> bin/spark-shell --master yarn
> error:
> 16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has 
> already exited with state FINISHED!
> 16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalStateException: Spark context stopped while waiting for 
> backend
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162)
> at org.apache.spark.SparkContext.(SparkContext.scala:549)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> at $line3.$read$$iw$$iw.(:15)
> at $line3.$read$$iw.(:31)
> at $line3.$read.(:33)
> at $line3.$read$.(:37)
> at $line3.$read$.()
> at $line3.$eval$.$print$lzycompute(:7)
> at $line3.$eval$.$print(:6)
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
> at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
> at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
> at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
> at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> 

[jira] [Created] (SPARK-16977) start spark-shell with yarn mode error

2016-08-09 Thread shijinkui (JIRA)
shijinkui created SPARK-16977:
-

 Summary: start spark-shell with yarn mode error
 Key: SPARK-16977
 URL: https://issues.apache.org/jira/browse/SPARK-16977
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.0.0
Reporter: shijinkui


spark 2.0.0
yarn 1.7.2
bin/spark-shell --master yarn

error:

16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has 
already exited with state FINISHED!
16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at 
org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162)
at org.apache.spark.SparkContext.(SparkContext.scala:549)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
at $line3.$read$$iw$$iw.(:15)
at $line3.$read$$iw.(:31)
at $line3.$read.(:33)
at $line3.$read$.(:37)
at $line3.$read$.()
at $line3.$eval$.$print$lzycompute(:7)
at $line3.$eval$.$print(:6)
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at 
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at 
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
at org.apache.spark.repl.Main$.doMain(Main.scala:68)
at org.apache.spark.repl.Main$.main(Main.scala:51)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at 

[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message

2016-08-09 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413408#comment-15413408
 ] 

shijinkui commented on SPARK-14559:
---

spark: 2.0.0
yarn: 1.7.2

yarn log error:
2016-08-09 19:31:22,061 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
CONTAINER_ALLOCATED at LAUNCHED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:776)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2016-08-09 19:31:22,654 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1470732309211_0016_02_02 Container Transitioned from ACQUIRED to 
RUNNING

2016-08-09 19:31:28,533 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Null container completed...
2016-08-09 19:31:29,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Null container completed...
2016-08-09 19:31:29,545 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Null container completed...





> Netty RPC didn't check channel is active before sending message
> ---
>
> Key: SPARK-14559
> URL: https://issues.apache.org/jira/browse/SPARK-14559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65
>Reporter: cen yuhai
>
> I have a long-running service. After running for serveral hours, It throwed 
> these exceptions. I  found that before sending rpc request by calling sendRpc 
> method in TransportClient, there is no check that whether the channel is 
> still open or active ?
> java.nio.channels.ClosedChannelException
>  4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 5635696155204230556 to 
> bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio.
>   channels.ClosedChannelException
>  4866 java.nio.channels.ClosedChannelException
>  4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 7319486003318455703 to 
> bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio.
>   channels.ClosedChannelException
>  4868 java.nio.channels.ClosedChannelException
>  4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9041854451893215954 to 
> bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio.
>   channels.ClosedChannelException
>  4870 java.nio.channels.ClosedChannelException
>  4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 6046473497871624501 to 
> bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio.  
>   channels.ClosedChannelException
>  4872 java.nio.channels.ClosedChannelException
>  4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9085605650438705047 to 
> bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio.
>   channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread shijinkui (JIRA)
shijinkui created SPARK-12953:
-

 Summary: RDDRelation write set mode will be better to avoid error 
"pair.parquet already exists"
 Key: SPARK-12953
 URL: https://issues.apache.org/jira/browse/SPARK-12953
 Project: Spark
  Issue Type: Wish
  Components: SQL
Reporter: shijinkui
 Fix For: 1.6.1


It will be error if not set Write Mode when execute test case 
`RDDRelation.main()`

Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
file:/Users/sjk/pair.parquet already exists.;
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread shijinkui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui updated SPARK-12953:
--
Component/s: (was: SQL)
 Examples

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Reporter: shijinkui
> Fix For: 1.6.1
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111750#comment-15111750
 ] 

shijinkui commented on SPARK-12953:
---

OK, get it.

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: shijinkui
>Priority: Minor
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7893) Complex Operators between Graphs

2015-06-10 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581199#comment-14581199
 ] 

shijinkui commented on SPARK-7893:
--

OK

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Umbrella
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs(G – H)
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7894) Graph Union Operator

2015-06-10 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581204#comment-14581204
 ] 

shijinkui commented on SPARK-7894:
--

hey [~ilganeli]
there are no union in EdgeRDD and VertexRDD directly. The `union` in `RDD` 
can't merge same element in one operation.
in the uion of Graph, we need merge same Vertex Attribute and same Edge 
attribute

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Sub-task
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. For vertex, it's quite nature to just make a union and 
 remove those duplicate ones. But for edges, a mergeEdges function seems to be 
 more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function

2015-05-27 Thread shijinkui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui updated SPARK-5062:
-
Fix Version/s: 1.3.2

 Pregel use aggregateMessage instead of mapReduceTriplets function
 -

 Key: SPARK-5062
 URL: https://issues.apache.org/jira/browse/SPARK-5062
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Reporter: shijinkui
 Fix For: 1.3.2

 Attachments: graphx_aggreate_msg.jpg


 since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
 improve the performance indeed.
 it's time to replace mapReduceTriplets with aggregateMessage in Pregel.
 we can discuss it.
 i have draw a graph of aggregateMessage to show why it can improve the 
 performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-04-21 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506383#comment-14506383
 ] 

shijinkui commented on SPARK-6932:
--

hi, @Xiangrui Meng. i have idea that training data  task keep  a persistent 
connection with parameter server using akka streams or netty directly. what do 
u think about this

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, Spark Core
Reporter: Qiping Li

  h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` on each partition of this RDD. 
 // This function can use data of this partition(the first argument) 
 // and a parameter server client(the second argument). 
 // See the following Logistic Regression for an example.
 def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U]

 {code}
 h2. Example
 Here is an example of using our prototype to implement logistic regression:
 {code:title=LogisticRegression.scala|borderStyle=solid}
 def train(
 sc: SparkContext,
 input: RDD[LabeledPoint],
 numIterations: Int,
 stepSize: Double,
 

[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-04-15 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497333#comment-14497333
 ] 

shijinkui commented on SPARK-6932:
--

hi, [~srowen].
this issue is one implement of SPARK-4590, maybe there are more better 
implements. It had been push to a forked branch. We wish to get some feedback 
from community about whether this implement is appropriated and how to 
implement it better.
As ShuffleMapTask and ResultTask can't meet iterative and updated operation in 
one job, We have to modify core when Implement PS on Spark.

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` on each partition of this RDD. 
 // This function can use data of this partition(the first argument) 
 // and a parameter server client(the second argument). 
 // See the following Logistic Regression for an example.
 def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U]

 {code}
 h2. Example
 Here is an example of using our prototype