Re: Map output statuses exceeds frameSize

2014-11-13 Thread pouryas
Anyone experienced this before? Any help would be appreciated 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Map-output-statuses-exceeds-frameSize-tp18783p18866.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Map output statuses exceeds frameSize

2014-11-12 Thread pouryas
Hey all

I am doing a groupby on nearly 2TB of data and I am getting this error:

2014-11-13 00:25:30 ERROR org.apache.spark.MapOutputTrackerMasterActor - Map
output statuses were 32163619 bytes which exceeds spark.akka.frameSize
(10485760 bytes).
org.apache.spark.SparkException: Map output statuses were 32163619 bytes
which exceeds spark.akka.frameSize (10485760 bytes).
at
org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:57)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




I did set the frameSize to 1000 in my driver's spark-default.conf file and I
have seen it being set in the environment tab in the UI, so why is it saying
that the frameSize is the default value? Is this not the correct way of
setting the frameSize or is this related to this bug?

https://issues.apache.org/jira/browse/SPARK-1239



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Map-output-statuses-exceeds-frameSize-tp18783.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3 - Extra $_folder$ files for every directory node

2014-09-30 Thread pouryas
I would like to know a way for not adding those $_folder$ files to S3 as
well. I can go ahead and delete them but it would be nice if Spark handles
this for you. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-tp15078p15402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re:

2014-09-25 Thread pouryas
I had similar problem writing to cassandra using the connector for cassandra.
I am not sure whether this will work or not but I reduced the number of
cores to 1 per machine and my job was stable. More explanation of my
issue...

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-Issue-and-performance-td15005.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/no-subject-tp15019p15134.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Cassandra Connector Issue and performance

2014-09-24 Thread pouryas
Hey all

I tried spark connector with Cassandra and I ran into a problem that I was
blocked on for couple of weeks. I managed to find a solution to the problem
but I am not sure whether it was a bug of the connector/spark or not. 

I had three tables in Cassandra (Running Cassandra on 5 node cluster) and a
large Spark cluster (5 worker node with each having 32 cores and 240G
Memory). 

When I ran my job which extracts data from S3 and writes to 3 tables in
Cassandra using around 1TB of memory and 160 cores, sometimes my job get
stuck at last few task of a stage...

After playing around for a while I realised that reducing number of cores to
2 per machine (10 Total) made the job stable. I gradually increased the
number of cores and it hanged again once I had about 50 cores total.

I would like to know if anyone else experienced this and if this is
explainable?


On another note I would like to know if people seeing good performance
reading from cassandra using spark as oppose to reading data from HDFS. Kind
of an open question but would like to see how others are using it.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-Issue-and-performance-tp15005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Optimal Cluster Setup for Spark

2014-09-24 Thread pouryas
Hi there

What is an optimal cluster setup for spark? Given X amount of resources,
would you favour more worker nodes with less resources or less worker node
with more resources. Is this application dependent? If so what are the
things to consider, what are good practices?

Cheers



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimal-Cluster-Setup-for-Spark-tp15007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org