[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649972#comment-16649972 ] shijinkui commented on SPARK-24630: --- I prefer without stream keyword. Because in the future bath and stream API must unify. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16977) start spark-shell with yarn mode error
[ https://issues.apache.org/jira/browse/SPARK-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui closed SPARK-16977. - Resolution: Won't Fix > start spark-shell with yarn mode error > -- > > Key: SPARK-16977 > URL: https://issues.apache.org/jira/browse/SPARK-16977 > Project: Spark > Issue Type: Temp > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: shijinkui > > spark 2.0.0 > yarn 1.7.2 > bin/spark-shell --master yarn > error: > 16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has > already exited with state FINISHED! > 16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext. > java.lang.IllegalStateException: Spark context stopped while waiting for > backend > at > org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581) > at > org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162) > at org.apache.spark.SparkContext.(SparkContext.scala:549) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) > at > scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) > at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) > at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) > at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) > at > scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) > at org.apache.spark.repl.Main$.doMain(Main.scala:68) > at org.apache.spark.repl.Main$.main(Main.scala:51) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Updated] (SPARK-16977) start spark-shell with yarn mode error
[ https://issues.apache.org/jira/browse/SPARK-16977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui updated SPARK-16977: -- Issue Type: Temp (was: Bug) > start spark-shell with yarn mode error > -- > > Key: SPARK-16977 > URL: https://issues.apache.org/jira/browse/SPARK-16977 > Project: Spark > Issue Type: Temp > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: shijinkui > > spark 2.0.0 > yarn 1.7.2 > bin/spark-shell --master yarn > error: > 16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has > already exited with state FINISHED! > 16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext. > java.lang.IllegalStateException: Spark context stopped while waiting for > backend > at > org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581) > at > org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162) > at org.apache.spark.SparkContext.(SparkContext.scala:549) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) > at > scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) > at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) > at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) > at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) > at > scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) > at org.apache.spark.repl.Main$.doMain(Main.scala:68) > at org.apache.spark.repl.Main$.main(Main.scala:51) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Created] (SPARK-16977) start spark-shell with yarn mode error
shijinkui created SPARK-16977: - Summary: start spark-shell with yarn mode error Key: SPARK-16977 URL: https://issues.apache.org/jira/browse/SPARK-16977 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.0.0 Reporter: shijinkui spark 2.0.0 yarn 1.7.2 bin/spark-shell --master yarn error: 16/08/09 20:27:53 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! 16/08/09 20:27:53 ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: Spark context stopped while waiting for backend at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:581) at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:162) at org.apache.spark.SparkContext.(SparkContext.scala:549) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) at $line3.$read$$iw$$iw.(:15) at $line3.$read$$iw.(:31) at $line3.$read.(:33) at $line3.$read$.(:37) at $line3.$read$.() at $line3.$eval$.$print$lzycompute(:7) at $line3.$eval$.$print(:6) at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at org.apache.spark.repl.Main$.doMain(Main.scala:68) at org.apache.spark.repl.Main$.main(Main.scala:51) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413408#comment-15413408 ] shijinkui commented on SPARK-14559: --- spark: 2.0.0 yarn: 1.7.2 yarn log error: 2016-08-09 19:31:22,061 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:795) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:776) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) at java.lang.Thread.run(Thread.java:745) 2016-08-09 19:31:22,654 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1470732309211_0016_02_02 Container Transitioned from ACQUIRED to RUNNING 2016-08-09 19:31:28,533 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed... 2016-08-09 19:31:29,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed... 2016-08-09 19:31:29,545 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed... > Netty RPC didn't check channel is active before sending message > --- > > Key: SPARK-14559 > URL: https://issues.apache.org/jira/browse/SPARK-14559 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65 >Reporter: cen yuhai > > I have a long-running service. After running for serveral hours, It throwed > these exceptions. I found that before sending rpc request by calling sendRpc > method in TransportClient, there is no check that whether the channel is > still open or active ? > java.nio.channels.ClosedChannelException > 4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 5635696155204230556 to > bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio. > channels.ClosedChannelException > 4866 java.nio.channels.ClosedChannelException > 4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 7319486003318455703 to > bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio. > channels.ClosedChannelException > 4868 java.nio.channels.ClosedChannelException > 4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9041854451893215954 to > bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio. > channels.ClosedChannelException > 4870 java.nio.channels.ClosedChannelException > 4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 6046473497871624501 to > bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio. > channels.ClosedChannelException > 4872 java.nio.channels.ClosedChannelException > 4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9085605650438705047 to > bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio. > channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
shijinkui created SPARK-12953: - Summary: RDDRelation write set mode will be better to avoid error "pair.parquet already exists" Key: SPARK-12953 URL: https://issues.apache.org/jira/browse/SPARK-12953 Project: Spark Issue Type: Wish Components: SQL Reporter: shijinkui Fix For: 1.6.1 It will be error if not set Write Mode when execute test case `RDDRelation.main()` Exception in thread "main" org.apache.spark.sql.AnalysisException: path file:/Users/sjk/pair.parquet already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui updated SPARK-12953: -- Component/s: (was: SQL) Examples > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Wish > Components: Examples >Reporter: shijinkui > Fix For: 1.6.1 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111750#comment-15111750 ] shijinkui commented on SPARK-12953: --- OK, get it. > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: shijinkui >Priority: Minor > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581199#comment-14581199 ] shijinkui commented on SPARK-7893: -- OK Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Umbrella Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581204#comment-14581204 ] shijinkui commented on SPARK-7894: -- hey [~ilganeli] there are no union in EdgeRDD and VertexRDD directly. The `union` in `RDD` can't merge same element in one operation. in the uion of Graph, we need merge same Vertex Attribute and same Edge attribute Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Sub-task Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function
[ https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui updated SPARK-5062: - Fix Version/s: 1.3.2 Pregel use aggregateMessage instead of mapReduceTriplets function - Key: SPARK-5062 URL: https://issues.apache.org/jira/browse/SPARK-5062 Project: Spark Issue Type: Wish Components: GraphX Reporter: shijinkui Fix For: 1.3.2 Attachments: graphx_aggreate_msg.jpg since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. we can discuss it. i have draw a graph of aggregateMessage to show why it can improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506383#comment-14506383 ] shijinkui commented on SPARK-6932: -- hi, @Xiangrui Meng. i have idea that training data task keep a persistent connection with parameter server using akka streams or netty directly. what do u think about this A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib, Spark Core Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func` on each partition of this RDD. // This function can use data of this partition(the first argument) // and a parameter server client(the second argument). // See the following Logistic Regression for an example. def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U] {code} h2. Example Here is an example of using our prototype to implement logistic regression: {code:title=LogisticRegression.scala|borderStyle=solid} def train( sc: SparkContext, input: RDD[LabeledPoint], numIterations: Int, stepSize: Double,
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497333#comment-14497333 ] shijinkui commented on SPARK-6932: -- hi, [~srowen]. this issue is one implement of SPARK-4590, maybe there are more better implements. It had been push to a forked branch. We wish to get some feedback from community about whether this implement is appropriated and how to implement it better. As ShuffleMapTask and ResultTask can't meet iterative and updated operation in one job, We have to modify core when Implement PS on Spark. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func` on each partition of this RDD. // This function can use data of this partition(the first argument) // and a parameter server client(the second argument). // See the following Logistic Regression for an example. def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U] {code} h2. Example Here is an example of using our prototype