Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

jan.zikes Mon, 06 Oct 2014 13:08:54 -0700

Hi,

Thank you for your advice. It really might work, but to specify my problem a 
bit more, think of my data more like one generated item is one parsed wikipedia 
page. I am getting this generator from the parser and I don't want to save it 
to the storage, but directly apply parallelize and create RDD, based on your 
advice I'm now thinking that something like batching and creating several RDDs 
and then applying union on them might possibly be the way to go.


Originaly I was thinking of calling the parsing function in flatMap on the RDD 
loaded from the xml file, but then I unfortunately had this problem 
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim
 so now I am trying to parse the xml on the master node an directly put it to 
the RDD. 
______________________________________________________________

Od: Davies Liu <dav...@databricks.com>
Komu: <jan.zi...@centrum.cz>
Datum: 06.10.2014 18:09
Předmět: Re: Spark and Python using generator of data bigger than RAM as input 
to sc.parallelize()

sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.

If you can partition your generator, then you can try this:

sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x))

such as you want to generate xrange(M), M is huge, so
sc.parallelize(range(N), N).flatMap(lambda x: xrange(M/N*x, M / N * (x+1))

On Mon, Oct 6, 2014 at 7:16 AM,  <jan.zi...@centrum.cz> wrote:

Hi,


I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this
way. When I am trying to create RDD by sc.TextFile(file) where file has even
bigger size than data generated by the generator everything works fine, but
unfortunately I need to use sc.parallelize(generator) and it makes my OS to
kill the spark job.



I'm getting only this log and then the job is killed:



14/10/06 13:34:16 INFO spark.SecurityManager: Changing view acls to: root,

14/10/06 13:34:16 INFO spark.SecurityManager: Changing modify acls to: root,

14/10/06 13:34:16 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(root, ); users with modify permissions: Set(root, )

14/10/06 13:34:16 INFO slf4j.Slf4jLogger: Slf4jLogger started

14/10/06 13:34:16 INFO Remoting: Starting remoting

14/10/06 13:34:17 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-172-31-25-197.ec2.internal:41016]

14/10/06 13:34:17 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@ip-172-31-25-197.ec2.internal:41016]

14/10/06 13:34:17 INFO util.Utils: Successfully started service
'sparkDriver' on port 41016.

14/10/06 13:34:17 INFO spark.SparkEnv: Registering MapOutputTracker

14/10/06 13:34:17 INFO spark.SparkEnv: Registering BlockManagerMaster

14/10/06 13:34:17 INFO storage.DiskBlockManager: Created local directory at
/mnt/spark/spark-local-20141006133417-821e

14/10/06 13:34:17 INFO util.Utils: Successfully started service 'Connection
manager for block manager' on port 42438.

14/10/06 13:34:17 INFO network.ConnectionManager: Bound socket to port 42438
with id = ConnectionManagerId(ip-172-31-25-197.ec2.internal,42438)

14/10/06 13:34:17 INFO storage.MemoryStore: MemoryStore started with
capacity 267.3 MB

14/10/06 13:34:17 INFO storage.BlockManagerMaster: Trying to register
BlockManager

14/10/06 13:34:17 INFO storage.BlockManagerMasterActor: Registering block
manager ip-172-31-25-197.ec2.internal:42438 with 267.3 MB RAM

14/10/06 13:34:17 INFO storage.BlockManagerMaster: Registered BlockManager

14/10/06 13:34:17 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-c4edda1c-0949-490d-8ff3-10993727c523

14/10/06 13:34:17 INFO spark.HttpServer: Starting HTTP Server

14/10/06 13:34:17 INFO server.Server: jetty-8.y.z-SNAPSHOT

14/10/06 13:34:17 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44768

14/10/06 13:34:17 INFO util.Utils: Successfully started service 'HTTP file
server' on port 44768.

14/10/06 13:34:17 INFO server.Server: jetty-8.y.z-SNAPSHOT

14/10/06 13:34:17 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040

14/10/06 13:34:17 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.

14/10/06 13:34:17 INFO ui.SparkUI: Started SparkUI at
http://ec2-54-164-72-236.compute-1.amazonaws.com:4040 
<http://ec2-54-164-72-236.compute-1.amazonaws.com:4040>

14/10/06 13:34:18 INFO util.Utils: Copying /root/generator_test.py to
/tmp/spark-0bafac0c-6779-4910-b095-0ede226fa3ce/generator_test.py

14/10/06 13:34:18 INFO spark.SparkContext: Added file
file:/root/generator_test.py at
http://172.31.25.197:44768/files/generator_test.py 
<http://172.31.25.197:44768/files/generator_test.py> with timestamp
1412602458065

14/10/06 13:34:18 INFO client.AppClient$ClientActor: Connecting to master
spark://ec2-54-164-72-236.compute-1.amazonaws.com:7077...

14/10/06 13:34:18 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend
is ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0

14/10/06 13:34:18 INFO cluster.SparkDeploySchedulerBackend: Connected to
Spark cluster with app ID app-20141006133418-0046

14/10/06 13:34:18 INFO client.AppClient$ClientActor: Executor added:
app-20141006133418-0046/0 on
worker-20141005074620-ip-172-31-30-40.ec2.internal-49979
(ip-172-31-30-40.ec2.internal:49979) with 1 cores

14/10/06 13:34:18 INFO cluster.SparkDeploySchedulerBackend: Granted executor
ID app-20141006133418-0046/0 on hostPort ip-172-31-30-40.ec2.internal:49979
with 1 cores, 512.0 MB RAM

14/10/06 13:34:18 INFO client.AppClient$ClientActor: Executor updated:
app-20141006133418-0046/0 is now RUNNING

14/10/06 13:34:21 INFO cluster.SparkDeploySchedulerBackend: Registered
executor:
Actor[akka.tcp://sparkExecutor@ip-172-31-30-40.ec2.internal:50877/user/Executor#-1621441852]
with ID 0

14/10/06 13:34:21 INFO storage.BlockManagerMasterActor: Registering block
manager ip-172-31-30-40.ec2.internal:34460 with 267.3 MB RAM


Thank you in advance for any advice or sugestion.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

Reply via email to