Re: python : Out of memory: Kill process
Hi, I change my process flow. Now I am processing a file per hour, instead of process at the end of the day. This decreased the memory comsuption . Regards Eduardo On Thu, Mar 26, 2015 at 3:16 PM, Davies Liu dav...@databricks.com wrote: Could you narrow down to a step which cause the OOM, something like: log2= self.sqlContext.jsonFile(path) log2.count() ... out.count() ... On Thu, Mar 26, 2015 at 10:34 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: the last try was without log2.cache() and still getting out of memory I using the following conf, maybe help: conf = (SparkConf() .setAppName(LoadS3) .set(spark.executor.memory, 13g) .set(spark.driver.memory, 13g) .set(spark.driver.maxResultSize,2g) .set(spark.default.parallelism,200) .set(spark.kryoserializer.buffer.mb,512)) sc = SparkContext(conf=conf ) sqlContext = SQLContext(sc) On Thu, Mar 26, 2015 at 2:29 PM, Davies Liu dav...@databricks.com wrote: Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: python : Out of memory: Kill process
the last try was without log2.cache() and still getting out of memory I using the following conf, maybe help: conf = (SparkConf() .setAppName(LoadS3) .set(spark.executor.memory, 13g) .set(spark.driver.memory, 13g) .set(spark.driver.maxResultSize,2g) .set(spark.default.parallelism,200) .set(spark.kryoserializer.buffer.mb,512)) sc = SparkContext(conf=conf ) sqlContext = SQLContext(sc) On Thu, Mar 26, 2015 at 2:29 PM, Davies Liu dav...@databricks.com wrote: Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: python : Out of memory: Kill process
I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: python : Out of memory: Kill process
Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu dav...@databricks.com wrote: With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
python : Out of memory: Kill process
Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: python : Out of memory: Kill process
Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: Play Scala Spark Exmaple
The EC2 versiĆ³n is 1.1.0 and this is my build.sbt: libraryDependencies ++= Seq( jdbc, anorm, cache, org.apache.spark %% spark-core % 1.1.0, com.typesafe.akka %% akka-actor % 2.2.3, com.typesafe.akka %% akka-slf4j % 2.2.3, org.apache.spark %% spark-streaming-twitter % 1.1.0, org.apache.spark %% spark-sql % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0 ) On Sun, Jan 11, 2015 at 3:01 AM, Akhil Das ak...@sigmoidanalytics.com wrote: What is your spark version that is running on the EC2 cluster? From the build file https://github.com/knoldus/Play-Spark-Scala/blob/master/build.sbt of your play application it seems that it uses Spark 1.0.1. Thanks Best Regards On Fri, Jan 9, 2015 at 7:17 PM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi guys, I running the following example : https://github.com/knoldus/Play-Spark-Scala in the same machine as the spark master, and the spark cluster was lauched with ec2 script. I'm stuck with this errors, any idea how to fix it? Regards Eduardo call the play app prints the following exception : [*error*] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481] - [akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575]: Error [Shut down address: akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575] [ akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. The master recive the spark application and generate the following stderr log page: 15/01/09 13:31:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:37856] 15/01/09 13:31:23 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:37856] 15/01/09 13:31:23 INFO util.Utils: Successfully started service 'sparkExecutor' on port 37856. 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/MapOutputTracker 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/BlockManagerMaster 15/01/09 13:31:23 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-local-20150109133123-3805 15/01/09 13:31:23 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-local-20150109133123-b05e 15/01/09 13:31:23 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 36936. 15/01/09 13:31:23 INFO network.ConnectionManager: Bound socket to port 36936 with id = ConnectionManagerId(ip-10-158-18-250.ec2.internal,36936) 15/01/09 13:31:23 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 15/01/09 13:31:23 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/01/09 13:31:23 INFO storage.BlockManagerMaster: Registered BlockManager 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/HeartbeatReceiver 15/01/09 13:31:54 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:57671] - [akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481] disassociated! Shutting down.
Play Scala Spark Exmaple
Hi guys, I running the following example : https://github.com/knoldus/Play-Spark-Scala in the same machine as the spark master, and the spark cluster was lauched with ec2 script. I'm stuck with this errors, any idea how to fix it? Regards Eduardo call the play app prints the following exception : [*error*] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481] - [akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575]: Error [Shut down address: akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575] [ akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@ip-10-158-18-250.ec2.internal:52575 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. The master recive the spark application and generate the following stderr log page: 15/01/09 13:31:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:37856] 15/01/09 13:31:23 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:37856] 15/01/09 13:31:23 INFO util.Utils: Successfully started service 'sparkExecutor' on port 37856. 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/MapOutputTracker 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/BlockManagerMaster 15/01/09 13:31:23 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-local-20150109133123-3805 15/01/09 13:31:23 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-local-20150109133123-b05e 15/01/09 13:31:23 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 36936. 15/01/09 13:31:23 INFO network.ConnectionManager: Bound socket to port 36936 with id = ConnectionManagerId(ip-10-158-18-250.ec2.internal,36936) 15/01/09 13:31:23 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 15/01/09 13:31:23 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/01/09 13:31:23 INFO storage.BlockManagerMaster: Registered BlockManager 15/01/09 13:31:23 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481/user/HeartbeatReceiver 15/01/09 13:31:54 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-158-18-250.ec2.internal:57671] - [akka.tcp://sparkDriver@ip-10-28-236-122.ec2.internal:47481] disassociated! Shutting down.
Re: EC2 VPC script
I running the master branch. Finally I can make it work, changing all occurrences of *public_dns_name* property with *private_ip_address* in the spark_ec2.py script. My VPC instances always have null value in *public_dns_name* property Now my script only work for VPC instances. Regards Eduardo On Sat, Dec 20, 2014 at 7:53 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: What version of the script are you running? What did you see in the EC2 web console when this happened? Sometimes instances just don't come up in a reasonable amount of time and you have to kill and restart the process. Does this always happen, or was it just once? Nick On Thu, Dec 18, 2014 at 9:42 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing cluster vpc_spark... Spark AMI: ami-5bb18832 Launching instances... Launched 1 slaves in us-east-1a, regid = r-e9d603c4 Launched master in us-east-1a, regid = r-89d104a4 Waiting for cluster to enter 'ssh-ready' state... any ideas what happend? regards Eduardo
EC2 VPC script
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing cluster vpc_spark... Spark AMI: ami-5bb18832 Launching instances... Launched 1 slaves in us-east-1a, regid = r-e9d603c4 Launched master in us-east-1a, regid = r-89d104a4 Waiting for cluster to enter 'ssh-ready' state... any ideas what happend? regards Eduardo
undefined
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing cluster vpc_spark... Spark AMI: ami-5bb18832 Launching instances... Launched 1 slaves in us-east-1a, regid = r-e9d603c4 Launched master in us-east-1a, regid = r-89d104a4 Waiting for cluster to enter 'ssh-ready' state... any ideas what happend?
Java client connection
HI guys, I starting to working with spark from java and when i run the folliwing code : SparkConf conf = new SparkConf().setMaster(spark://10.0.2.20:7077 ).setAppName(SparkTest); JavaSparkContext sc = new JavaSparkContext(conf); I recived the following error and the java process exit ends: 14/11/11 17:58:12 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@10.0.2.20:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@10.0.2.20:7077] 14/11/11 17:58:18 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@10.0.2.20:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@10.0.2.20:7077] 14/11/11 17:58:20 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@10.0.2.20:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@10.0.2.20:7077] My spark master is run on 10.0.2.20. From pyspark I can work properly. Regards Eduardo