I have what I think is a pretty simple task and one that works pretty well
with  Celery <http://www.celeryproject.org/>  .  I wanted to see how easy it
was to configure for Spark since I already run a Mesos cluster for something
else.  But I had a pretty hard time getting Spark configured so that it will
be as fast and I'd like to hear from other Spark users whether or not I'm
doing it wrong.


It's super simple.  Inputs are URLs, outputs are the responses after making
a POST to that URL.  URLs are < 100 bytes and the responses are about 16 KB. 
In Celery, I would just use urllib2.Request.  The only caveat was that I had
to specify a timeout to prevent my workers from getting stuck for too long.

-----

Now I'm trying to do the same thing with pyspark.  I have a bunch of
questions about Spark, so I'm going to spell out my task step by step.


Here's my job config:

How do I configure spark.executor.memory for a cluster where each machine
has a different amount of RAM made available to mesos?  I'm doing this on
EC2.  Some instances are about 3.75 GB and others are about 15 GB.  Is there
a single executor per machine or can I specify a memory allocation per task?


I split my inputs into several smaller files and put them in S3 using s3://. 
I tried with both a single large file and many small files.  It seemed like
it was trying to load each file as a whole, so that's why I use smaller
files now.  Is that conclusion incorrect?  I load the files using:

Is there a recommended input file size for S3?


Next, I use mapPartitions to make the requests and yield the results. 
Here's the part that I had a hard time getting to work like Celery.  I've
tried two things to work around the blocking urllib2 requests.  One is to
use gevent and the other is to use multiprocessing.  I set up gevent by
doing this at the top of my script:

I run into the issue as described in  pyspark + gevent
<http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCACNhcBE7OO47rMTMmRVCcCdhzUSRoSXRLNZ56ZKmQ+g=3qd...@mail.gmail.com%3E>
 
, but I just ctrl-c and let it continue.  The throughput barely increases
though which is weird since my Celery implementation does.  I have no
understanding of how gevent interacts with pyspark, so unless someone else
knows, I didn't want to delve into that too much and instead switch to
multiprocessing.  When I use multiprocessing Pool, I get this error:


So gevent didn't work as expected and multiprocessing just broke.  But, I
just tried moving the gevent monkey_patch into the mapper function itself
and *throughput like woah*.  So my understanding is that only the mapper
function is evaluated on the workers?  It wasn't executing the monkey_patch? 
What else can I not assume about the pyspark scripts that I want to execute
that could be assumed about using something like multiprocessing?  Also, I'm
not even sure if I can trust gevent if the monkey patch doesn't happen
almost immediately.  Can I in pyspark?

And once again, I'd like to hear about how others solve this common use
case.  This seems like almost a hello world for spark example.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/need-help-with-simple-http-request-mapper-tp20802.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to