Regarding master node failure

2015-07-07 Thread swetha
Hi,

What happens if a master node fails in the case of Spark Streaming? Would
the data be lost in that case?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Regarding-master-node-failure-tp13055.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Data interaction between various RDDs in Spark Streaming

2015-07-07 Thread swetha
Hi,

Suppose I want the data to be grouped by and Id named 12345 and I have
certain amount of data coming out from one batch for 12345 and I have data
related to 12345 coming after 5 hours, how do I group by 12345 and have
a single RDD of list?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Data-interaction-between-various-RDDs-in-Spark-Streaming-tp13058.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-07 Thread Andrew Or
+1

Verified that the previous blockers SPARK-8781 and SPARK-8819 are now
resolved.

2015-07-07 12:06 GMT-07:00 Patrick Wendell pwend...@gmail.com:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc3 (commit 3e8ae38):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 3e8ae38944f13895daf328555c1ad22cd590b089

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1123/
 [published as version: 1.4.1-rc3]
 https://repository.apache.org/content/repositories/orgapachespark-1124/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Friday, July 10, at 20:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Data interaction between various RDDs in Spark Streaming

2015-07-07 Thread Akhil Das
UpdatestateByKey?

Thanks
Best Regards

On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote:

 Hi,

 Suppose I want the data to be grouped by and Id named 12345 and I have
 certain amount of data coming out from one batch for 12345 and I have
 data
 related to 12345 coming after 5 hours, how do I group by 12345 and have
 a single RDD of list?

 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Data-interaction-between-various-RDDs-in-Spark-Streaming-tp13058.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




[RESULT] [VOTE] Release Apache Spark 1.4.1 (RC2)

2015-07-07 Thread Patrick Wendell
Hey All,

This vote is cancelled in favor of RC3.

- Patrick

On Fri, Jul 3, 2015 at 1:15 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc2 (commit 07b95c7):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 07b95c7adf88f0662b7ab1c47e302ff5e6859606

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1120/
 [published as version: 1.4.1-rc2]
 https://repository.apache.org/content/repositories/orgapachespark-1121/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Monday, July 06, at 22:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unable to add to roles in JIRA

2015-07-07 Thread Reynold Xin
BTW Infra has the ability to create multiple groups. Maybe that's a better
solution.

Have contributor1, contributor2, contributor3 ...

On Tue, Jul 7, 2015 at 1:42 PM, Sean Owen so...@cloudera.com wrote:

 Yeah, I've just realized a problem, that the permission for Developer
 are not the same as Contributor. It includes the ability to Assign,
 but doesn't seem to include other more basic permission.

 I cleared room in Contributor the meantime (no point in having
 Committers there; Committer permission is a superset), and I think we
 can actually fix this long-term by just removing barely-active people
 from Contributor since it won't matter (they only need to be in the
 group to be Assigned usually).

 I've also pinged the ticket to get more control over JIRA permissions
 so we can rectify more of this.

 On Tue, Jul 7, 2015 at 9:39 PM, Reynold Xin r...@databricks.com wrote:
  I've been adding people to the developer role to get around the jira
 limit.
 
 
  On Tue, Jul 7, 2015 at 3:05 AM, Sean Owen so...@cloudera.com wrote:
 
  PS the resolution on this is just that we've hit a JIRA limit, since
  the Contributor role is so big now.
 
  We have a currently-unused Developer role that barely has different
  permissions. I propose to move people that I recognize as regular
  Contributors into the Developer group to make room. Practically
  speaking, there's no difference. Just a heads up in case you see
  changes here.
 
  But for reference, new contributors should go to Contributors by
 default.
 
  On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote:
   In case you've tried and failed to add a person to a role in JIRA...
   https://issues.apache.org/jira/browse/INFRA-9891
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Unable to add to roles in JIRA

2015-07-07 Thread Reynold Xin
I've been adding people to the developer role to get around the jira limit.


On Tue, Jul 7, 2015 at 3:05 AM, Sean Owen so...@cloudera.com wrote:

 PS the resolution on this is just that we've hit a JIRA limit, since
 the Contributor role is so big now.

 We have a currently-unused Developer role that barely has different
 permissions. I propose to move people that I recognize as regular
 Contributors into the Developer group to make room. Practically
 speaking, there's no difference. Just a heads up in case you see
 changes here.

 But for reference, new contributors should go to Contributors by default.

 On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote:
  In case you've tried and failed to add a person to a role in JIRA...
  https://issues.apache.org/jira/browse/INFRA-9891

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




spark - redshift !!!

2015-07-07 Thread spark user
Hi Can you help me how to load data from s3 bucket to  redshift , if you gave 
sample code can you pls send me 
Thanks su

Re: Unable to add to roles in JIRA

2015-07-07 Thread Sean Owen
PS the resolution on this is just that we've hit a JIRA limit, since
the Contributor role is so big now.

We have a currently-unused Developer role that barely has different
permissions. I propose to move people that I recognize as regular
Contributors into the Developer group to make room. Practically
speaking, there's no difference. Just a heads up in case you see
changes here.

But for reference, new contributors should go to Contributors by default.

On Sun, Jun 28, 2015 at 11:27 AM, Sean Owen so...@cloudera.com wrote:
 In case you've tried and failed to add a person to a role in JIRA...
 https://issues.apache.org/jira/browse/INFRA-9891

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



TableScan vs PrunedScan

2015-07-07 Thread Gil Vernik
Hi All,

I wanted to experiment a little bit with TableScan and PrunedScan.
My first test was to print columns from various SQL queries. 
To make this test easier, i just took spark-csv and i replaced TableScan 
with PrunedScan. 
I then changed buildScan method of CsvRelation from 

def BuildScan = { 

to 

def buildScan(requiredColumns: Array[String]) = {?

This was the only modification i did to CsvRelation.scala.  And I added 
print of requiredColums to log.

I then took the same CSV file and run very simple SELECT query on it.
I noticed that when CsvRelation used TableScan - all worked correctly.
But when i used PrunedScan - it didn?t worked and returned empty columns / 
or columns in wrong order. 

Why is this happens? Is it some bug? Because I thought that PrunedScan 
suppose to work exactly the same as TableScan and i can modify freely 
TableScan to PrunedScan. I thought that the only difference is that 
buildScan of PrunedScan has requiredColumns as parameter.

Can someone explain me the behavior i saw?

I am using Spark 1.5 from trunk.
Thanks a lot
Gil.

Re: TableScan vs PrunedScan

2015-07-07 Thread Ram Sriharsha
Hi Gil

You would need to prune the resulting Row as well based on the requested 
columns.

Ram

Sent from my iPhone

 On Jul 7, 2015, at 3:12 AM, Gil Vernik g...@il.ibm.com wrote:
 
 Hi All, 
 
 I wanted to experiment a little bit with TableScan and PrunedScan. 
 My first test was to print columns from various SQL queries.  
 To make this test easier, i just took spark-csv and i replaced TableScan with 
 PrunedScan. 
 I then changed buildScan method of CsvRelation from 
 
 def BuildScan = { 
 
 to  
 
 def buildScan(requiredColumns: Array[String]) = {… 
 
 This was the only modification i did to CsvRelation.scala.  And I added print 
 of requiredColums to log. 
 
 I then took the same CSV file and run very simple SELECT query on it. 
 I noticed that when CsvRelation used TableScan - all worked correctly. 
 But when i used PrunedScan - it didn’t worked and returned empty columns / or 
 columns in wrong order.  
 
 Why is this happens? Is it some bug? Because I thought that PrunedScan 
 suppose to work exactly the same as TableScan and i can modify freely 
 TableScan to PrunedScan. I thought that the only difference is that buildScan 
 of PrunedScan has requiredColumns as parameter. 
 
 Can someone explain me the behavior i saw? 
 
 I am using Spark 1.5 from trunk. 
 Thanks a lot 
 Gil.


thrift server reliability issue

2015-07-07 Thread Judy Nash
Hi everyone,

Found a thrift server reliability issue on spark 1.3.1 that causes thrift to 
fail.

When thrift server has too little memory allocated to the driver to process the 
request, its Spark SQL session exits with OutOfMemory exception, causing thrift 
server to stop working.

Is this a known issue?

Thanks,
Judy

--
Full stacktrace of out of memory exception:
2015-07-08 03:30:18,011 ERROR actor.ActorSystemImpl 
(Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal error from thread 
[sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at 
org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
at 
akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
at 
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at 
akka.serialization.Serialization.deserialize(Serialization.scala:98)
at 
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at 
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at 
akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at 
akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Spark job hangs when History server events are written to hdfs

2015-07-07 Thread Pankaj Arora
Hi,

I am running long running application over yarn using spark and I am facing 
issues while using spark’s history server when the events are written to hdfs. 
It seems to work fine for some time and in between I see following exception.


2015-06-01 00:00:03,247 [SparkListenerBus] ERROR 
org.apache.spark.scheduler.LiveListenerBus - Listener EventLoggingListener 
threw an exception

java.lang.reflect.InvocationTargetException

at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)

at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)

at scala.Option.foreach(Option.scala:236)

at org.apache.spark.util.FileLogger.flush(FileLogger.scala:203)

at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90)

at 
org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121)

at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)

at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)

at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)

at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)

at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)

at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66)

at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)

at scala.Option.foreach(Option.scala:236)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1545)

at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)

Caused by: java.io.IOException: All datanodes 192.168.162.54:50010 are bad. 
Aborting...

at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)

at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)



And after that this error continue to come and spark reaches into unstable 
stage where no job is able to progress.

FYI.
HDFS was up and running before and after this error and on restarting 
application it runs fine for some hours and again same error comes.
Enough disk space was available on each data node.

Any suggestion or help would be appreciated.

Regards
Pankaj