Regarding master node failure
Hi, What happens if a master node fails in the case of Spark Streaming? Would the data be lost in that case? Thanks, Swetha -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regarding-master-node-failure-tp13055.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Data interaction between various RDDs in Spark Streaming
Hi, Suppose I want the data to be grouped by and Id named 12345 and I have certain amount of data coming out from one batch for 12345 and I have data related to 12345 coming after 5 hours, how do I group by 12345 and have a single RDD of list? Thanks, Swetha -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Data-interaction-between-various-RDDs-in-Spark-Streaming-tp13058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.1 (RC3)
+1 Verified that the previous blockers SPARK-8781 and SPARK-8819 are now resolved. 2015-07-07 12:06 GMT-07:00 Patrick Wendell pwend...@gmail.com: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on is v1.4.1-rc3 (commit 3e8ae38): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 3e8ae38944f13895daf328555c1ad22cd590b089 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.1] https://repository.apache.org/content/repositories/orgapachespark-1123/ [published as version: 1.4.1-rc3] https://repository.apache.org/content/repositories/orgapachespark-1124/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.1! The vote is open until Friday, July 10, at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Data interaction between various RDDs in Spark Streaming
UpdatestateByKey? Thanks Best Regards On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote: Hi, Suppose I want the data to be grouped by and Id named 12345 and I have certain amount of data coming out from one batch for 12345 and I have data related to 12345 coming after 5 hours, how do I group by 12345 and have a single RDD of list? Thanks, Swetha -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Data-interaction-between-various-RDDs-in-Spark-Streaming-tp13058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[RESULT] [VOTE] Release Apache Spark 1.4.1 (RC2)
Hey All, This vote is cancelled in favor of RC3. - Patrick On Fri, Jul 3, 2015 at 1:15 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on is v1.4.1-rc2 (commit 07b95c7): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 07b95c7adf88f0662b7ab1c47e302ff5e6859606 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.1] https://repository.apache.org/content/repositories/orgapachespark-1120/ [published as version: 1.4.1-rc2] https://repository.apache.org/content/repositories/orgapachespark-1121/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.4.1! The vote is open until Monday, July 06, at 22:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Unable to add to roles in JIRA
BTW Infra has the ability to create multiple groups. Maybe that's a better solution. Have contributor1, contributor2, contributor3 ... On Tue, Jul 7, 2015 at 1:42 PM, Sean Owen so...@cloudera.com wrote: Yeah, I've just realized a problem, that the permission for Developer are not the same as Contributor. It includes the ability to Assign, but doesn't seem to include other more basic permission. I cleared room in Contributor the meantime (no point in having Committers there; Committer permission is a superset), and I think we can actually fix this long-term by just removing barely-active people from Contributor since it won't matter (they only need to be in the group to be Assigned usually). I've also pinged the ticket to get more control over JIRA permissions so we can rectify more of this. On Tue, Jul 7, 2015 at 9:39 PM, Reynold Xin r...@databricks.com wrote: I've been adding people to the developer role to get around the jira limit. On Tue, Jul 7, 2015 at 3:05 AM, Sean Owen so...@cloudera.com wrote: PS the resolution on this is just that we've hit a JIRA limit, since the Contributor role is so big now. We have a currently-unused Developer role that barely has different permissions. I propose to move people that I recognize as regular Contributors into the Developer group to make room. Practically speaking, there's no difference. Just a heads up in case you see changes here. But for reference, new contributors should go to Contributors by default. On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote: In case you've tried and failed to add a person to a role in JIRA... https://issues.apache.org/jira/browse/INFRA-9891 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Unable to add to roles in JIRA
I've been adding people to the developer role to get around the jira limit. On Tue, Jul 7, 2015 at 3:05 AM, Sean Owen so...@cloudera.com wrote: PS the resolution on this is just that we've hit a JIRA limit, since the Contributor role is so big now. We have a currently-unused Developer role that barely has different permissions. I propose to move people that I recognize as regular Contributors into the Developer group to make room. Practically speaking, there's no difference. Just a heads up in case you see changes here. But for reference, new contributors should go to Contributors by default. On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote: In case you've tried and failed to add a person to a role in JIRA... https://issues.apache.org/jira/browse/INFRA-9891 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark - redshift !!!
Hi Can you help me how to load data from s3 bucket to redshift , if you gave sample code can you pls send me Thanks su
Re: Unable to add to roles in JIRA
PS the resolution on this is just that we've hit a JIRA limit, since the Contributor role is so big now. We have a currently-unused Developer role that barely has different permissions. I propose to move people that I recognize as regular Contributors into the Developer group to make room. Practically speaking, there's no difference. Just a heads up in case you see changes here. But for reference, new contributors should go to Contributors by default. On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote: In case you've tried and failed to add a person to a role in JIRA... https://issues.apache.org/jira/browse/INFRA-9891 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
TableScan vs PrunedScan
Hi All, I wanted to experiment a little bit with TableScan and PrunedScan. My first test was to print columns from various SQL queries. To make this test easier, i just took spark-csv and i replaced TableScan with PrunedScan. I then changed buildScan method of CsvRelation from def BuildScan = { to def buildScan(requiredColumns: Array[String]) = {? This was the only modification i did to CsvRelation.scala. And I added print of requiredColums to log. I then took the same CSV file and run very simple SELECT query on it. I noticed that when CsvRelation used TableScan - all worked correctly. But when i used PrunedScan - it didn?t worked and returned empty columns / or columns in wrong order. Why is this happens? Is it some bug? Because I thought that PrunedScan suppose to work exactly the same as TableScan and i can modify freely TableScan to PrunedScan. I thought that the only difference is that buildScan of PrunedScan has requiredColumns as parameter. Can someone explain me the behavior i saw? I am using Spark 1.5 from trunk. Thanks a lot Gil.
Re: TableScan vs PrunedScan
Hi Gil You would need to prune the resulting Row as well based on the requested columns. Ram Sent from my iPhone On Jul 7, 2015, at 3:12 AM, Gil Vernik g...@il.ibm.com wrote: Hi All, I wanted to experiment a little bit with TableScan and PrunedScan. My first test was to print columns from various SQL queries. To make this test easier, i just took spark-csv and i replaced TableScan with PrunedScan. I then changed buildScan method of CsvRelation from def BuildScan = { to def buildScan(requiredColumns: Array[String]) = {… This was the only modification i did to CsvRelation.scala. And I added print of requiredColums to log. I then took the same CSV file and run very simple SELECT query on it. I noticed that when CsvRelation used TableScan - all worked correctly. But when i used PrunedScan - it didn’t worked and returned empty columns / or columns in wrong order. Why is this happens? Is it some bug? Because I thought that PrunedScan suppose to work exactly the same as TableScan and i can modify freely TableScan to PrunedScan. I thought that the only difference is that buildScan of PrunedScan has requiredColumns as parameter. Can someone explain me the behavior i saw? I am using Spark 1.5 from trunk. Thanks a lot Gil.
thrift server reliability issue
Hi everyone, Found a thrift server reliability issue on spark 1.3.1 that causes thrift to fail. When thrift server has too little memory allocated to the driver to process the request, its Spark SQL session exits with OutOfMemory exception, causing thrift server to stop working. Is this a known issue? Thanks, Judy -- Full stacktrace of out of memory exception: 2015-07-08 03:30:18,011 ERROR actor.ActorSystemImpl (Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515) at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64) at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at scala.util.Try$.apply(Try.scala:161) at akka.serialization.Serialization.deserialize(Serialization.scala:98) at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Spark job hangs when History server events are written to hdfs
Hi, I am running long running application over yarn using spark and I am facing issues while using spark’s history server when the events are written to hdfs. It seems to work fine for some time and in between I see following exception. 2015-06-01 00:00:03,247 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:203) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90) at org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1545) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) Caused by: java.io.IOException: All datanodes 192.168.162.54:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) And after that this error continue to come and spark reaches into unstable stage where no job is able to progress. FYI. HDFS was up and running before and after this error and on restarting application it runs fine for some hours and again same error comes. Enough disk space was available on each data node. Any suggestion or help would be appreciated. Regards Pankaj