Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Anshul Singhle
Hi Akhil,

I discovered the reason for this problem. There was in issue with my
deployment (Google Cloud Platform). My spark master was on a different
"region" than the slaves. This resulted in huge scheduler delays.

Thanks,
Anshul

On Thu, Apr 30, 2015 at 1:39 PM, Akhil Das 
wrote:

> Does this speed up?
>
>
> val rdd = sc.parallelize(1 to 100*, 30)*
> rdd.count
>
>
>
>
> Thanks
> Best Regards
>
> On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle 
> wrote:
>
>> Hi,
>>
>> I'm running the following code in my cluster (standalone mode) via spark
>> shell -
>>
>> val rdd = sc.parallelize(1 to 100)
>> rdd.count
>>
>> This takes around 1.2s to run.
>>
>> Is this expected or am I configuring something wrong?
>>
>> I'm using about 30 cores with 512MB executor memory
>>
>> As expected, GC time is negligible. I'm just getting some scheduler delay
>> and 1s to launch the task
>>
>> Thanks,
>>
>> Anshul
>>
>
>


Re: java.io.IOException: No space left on device

2015-04-29 Thread Anshul Singhle
Do you have multiple disks? Maybe your work directory is not in the right
disk?

On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi  wrote:

> Hi,
>
> I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
> output,the training data is a file containing 156060 (size 8.1M).
>
> The problem is that when trying to presist a partition into memory and
> there
> is not enought memory, the partition is persisted on disk and despite
> Having
> 229G of free disk space, I got " No space left on device"..
>
> This is how I'm running the program :
>
> ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
> local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
> testData.tsv
>
> And this is a part of the log:
>
>
>
> If you need more informations, please let me know.
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
yes
On 29 Apr 2015 03:31, "ayan guha"  wrote:

> Are your driver running on the same m/c as master?
> On 29 Apr 2015 03:59, "Anshul Singhle"  wrote:
>
>> Hi,
>>
>> I'm running short spark jobs on rdds cached in memory. I'm also using a
>> long running job context. I want to be able to complete my jobs (on the
>> cached rdd) in under 1 sec.
>> I'm getting the following job times with about 15 GB of data distributed
>> across 6 nodes. Each executor has about 20GB of memory available. My
>> context has about 26 cores in total.
>>
>> If number of partitions < no of cores -
>>
>> Some jobs run in 3s others take about 6s -- the time difference can be
>> explained by GC time.
>>
>> If number of partitions = no of cores -
>>
>> All jobs run in 4s. The initial tasks of each stage on every executor
>> take about 1s.
>>
>> If partitions > cores -
>>
>> Jobs take more time. The initial tasks of each stage on every executor
>> take about 1s. The other tasks run in 45-50 ms each. However, since the
>> initial tasks again take about 1s each, the total time in this case is
>> about 6s which is more than the previous case.
>>
>> Clearly the limiting factor here is the initial set of tasks. For every
>> case, these tasks take 1s to run, no matter the amount of partitions. Hence
>> best results are obtained with partitions = cores, because in that case.
>> every core gets 1 task which takes 1s to run.
>> In this case, I get 0 GC time. The only explanation is "scheduling delay"
>> which is about 0.2 - 0.3 seconds. I looked at my task size and result size
>> and that has no bearing on this delay. Also, I'm not getting the task size
>> warnings in the logs.
>>
>> For what I can understand, the first time a task runs on a core, it takes
>> 1s to run. Is this normal?
>>
>> Is it possible to get sub-second latencies?
>>
>> Can something be done about the scheduler delay?
>>
>> What other things can I look at to reduce this time?
>>
>> Regards,
>>
>> Anshul
>>
>>
>>


rdd.count with 100 elements taking 1 second to run

2015-04-28 Thread Anshul Singhle
Hi,

I'm running the following code in my cluster (standalone mode) via spark
shell -

val rdd = sc.parallelize(1 to 100)
rdd.count

This takes around 1.2s to run.

Is this expected or am I configuring something wrong?

I'm using about 30 cores with 512MB executor memory

As expected, GC time is negligible. I'm just getting some scheduler delay
and 1s to launch the task

Thanks,

Anshul


Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
Hi,

I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to be able to complete my jobs (on the
cached rdd) in under 1 sec.
I'm getting the following job times with about 15 GB of data distributed
across 6 nodes. Each executor has about 20GB of memory available. My
context has about 26 cores in total.

If number of partitions < no of cores -

Some jobs run in 3s others take about 6s -- the time difference can be
explained by GC time.

If number of partitions = no of cores -

All jobs run in 4s. The initial tasks of each stage on every executor take
about 1s.

If partitions > cores -

Jobs take more time. The initial tasks of each stage on every executor take
about 1s. The other tasks run in 45-50 ms each. However, since the initial
tasks again take about 1s each, the total time in this case is about 6s
which is more than the previous case.

Clearly the limiting factor here is the initial set of tasks. For every
case, these tasks take 1s to run, no matter the amount of partitions. Hence
best results are obtained with partitions = cores, because in that case.
every core gets 1 task which takes 1s to run.
In this case, I get 0 GC time. The only explanation is "scheduling delay"
which is about 0.2 - 0.3 seconds. I looked at my task size and result size
and that has no bearing on this delay. Also, I'm not getting the task size
warnings in the logs.

For what I can understand, the first time a task runs on a core, it takes
1s to run. Is this normal?

Is it possible to get sub-second latencies?

Can something be done about the scheduler delay?

What other things can I look at to reduce this time?

Regards,

Anshul


Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Anshul Singhle
Hi firemonk9,

What you're doing looks interesting. Can you share some more details?
Are you running the same spark context for each job, or are you running a
seperate spark context for each job?
Does your system need sharing of rdd's across multiple jobs? If yes, how do
you implement that?
Also what prompted you to run Yarn instead of standalone? Does this give
some performance benefit? Have you evaluated yarn vs mesos?
Also have you looked at spark jobserver by ooyala? It makes doing some if
the stuff I mentioned easier. IIRC it also works with yarn. Definitely
works with Mesos. Heres the link
https://github.com/spark-jobserver/spark-jobserver

Thanks
Anshul
On 23 Apr 2015 20:32, "Dean Wampler"  wrote:

> I strongly recommend spawning a new process for the Spark jobs. Much
> cleaner separation. Your driver program won't be clobbered if the Spark job
> dies, etc. It can even watch for failures and restart.
>
> In the Scala standard library, the sys.process package has classes for
> constructing and interoperating with external processes. Perhaps Java has
> something similar these days?
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Tue, Apr 21, 2015 at 2:15 PM, Steve Loughran 
> wrote:
>
>>
>>  On 21 Apr 2015, at 17:34, Richard Marscher 
>> wrote:
>>
>> - There are System.exit calls built into Spark as of now that could kill
>> your running JVM. We have shadowed some of the most offensive bits within
>> our own application to work around this. You'd likely want to do that or to
>> do your own Spark fork. For example, if the SparkContext can't connect to
>> your cluster master node when it is created, it will System.exit.
>>
>>
>> people can block "errant" System.exit calls by running under a
>> SecurityManager. Less than ideal (and there's a small performance hit) -but
>> possible
>>
>
>


Getting outofmemory errors on spark

2015-04-10 Thread Anshul Singhle
Hi,

I'm reading data stored in S3 and aggregating and storing it in Cassandra
using a spark job.

When I run the job with approx 3Mil records (about 3-4 GB of data) stored
in text files, I get the following error:

 (11529/14925)15/04/10 19:32:43 INFO TaskSetManager: Starting task 11609.0
in stage 4.0 (TID 56384,
spark-slaves-test-cluster-k0b6.c.silver-argon-837.internal,
PROCESS_LOCAL, 134 System information as of Fri Apr 10 19:08:57 UTC
201515/04/10 19:32:58 ERROR ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
ActorSystem [sparkDriv System load: 0.07 Processes: 155 Usage of /: 48.3%
of 9.81GB Users logged in:
015/04/10 19:32:58 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
ActorSystem [sparkDriver]
*java.lang.OutOfMemoryError: GC overhead limit exceeded at*
 java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.
AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at
java.lang.AbstractStringBuilder.ensureCapacityInternal(
AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(
AbstractStringBuilder.java:535) at
java.lang.StringBuilder.append(StringBuilder.java:204)
at java.io.ObjectInputStream$BlockDataInputStream.
readUTFSpan(ObjectInputStream.java:3143) at java.io.ObjectInputStream$
BlockDataInputStream.readUTFBody(ObjectInputStream.java:3051) at
java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2864)
at java.io.ObjectInputStream.readUTF(ObjectInputStream.java:1072) at
java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:671) at
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:830)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136) at
scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136) at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161) at akka.serialization.
Serialization.deserialize(Serialization.scala:98) at
akka.remote.serialization.MessageContainerSerializer.fromBinary(
MessageContainerSerializer.scala:63) at akka.serialization.
Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at
scala.util.Try$.apply(Try.scala:161) at akka.serialization.
Serialization.deserialize(Serialization.scala:98) at
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) at
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58) at
akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76) at
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

This error occurs in the final step of my script, when i'm storing the
processed records in Cassandra.

My memory-per-node is 10GB which means that *all my records should fit on
one machine.*

The script is in pyspark and I'm using a cluster with:

   - *Workers:* 5
   - *Cores:* 80 Total, 80 Used
   - *Memory:* 506.5 GB Total, 40.0 GB Used

Here is the relevant part of the code, for reference :

def connectAndSave(partition):
   cluster = Cluster(['10.240.1.17'])
   dbsession = cluster.connect("load_test")
   ret = map(lambda x : saveUserData(x,dbsession),partition)
   dbsession.shutdown()
   cluster.shutdown()



res = sessionsRdd.foreachPartition(lambda partition : connectAndSave(
partition))