unsubscribe
unsubscribe -- haitao.yao
add support for separate GC log files for different executor
Hey, guys. Here's my problem: While using the standalone mode, I always use the following args for executor: -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/tmp/spark.executor.gc.log But as we know, hotspot JVM does not support variable substitution on -Xloggc parameter, which will cause gc log be overwritten by other later executors. May I create a new path, which will add variable substitution before worker forks a new executor to avoid GC log overwriteen? First thoughts: configure the executor jvm args like this: -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/tmp/spark.executor.%applicationId%.gc.log and this will replace the %applicationId% with the current application ID and pass the final args into java command line We can support more variables such as executorId Thanks. -- haitao.yao
Re: spark_ec2.py for AWS region: cn-north-1, China
Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241 Thanks. 2014-11-05 10:58 GMT+08:00 Nicholas Chammas : > Oh, I can see that region via boto as well. Perhaps the doc is indeed out > of date. > > Do you mind opening a JIRA issue > <https://issues.apache.org/jira/secure/Dashboard.jspa> to track this > request? I can do it if you've never opened a JIRA issue before. > > Nick > > On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao wrote: > >> I'm afraid not. We have been using EC2 instances in cn-north-1 region for >> a while. And the latest version of boto has added the region: cn-north-1 >> Here's the screenshot: >> >>>> from boto import ec2 >> >>> ec2.regions() >> [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1, >> RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2, >> RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1, >> RegionInfo:eu-central-1, RegionInfo:sa-east-1] >> >>> >> >> I do think the doc is out of dated. >> >> >> >> 2014-11-05 9:45 GMT+08:00 Nicholas Chammas : >> >>> >>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html >>> >>> cn-north-1 is not a supported region for EC2, as far as I can tell. >>> There may be other AWS services that can use that region, but spark-ec2 >>> relies on EC2. >>> >>> Nick >>> >>> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao wrote: >>> >>>> Hi, >>>>Amazon aws started to provide service for China mainland, the region >>>> name is cn-north-1. But the script spark provides: spark_ec2.py will query >>>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and >>>> there's no ami information for cn-north-1 region . >>>>Can anybody update the ami information and update the reo: >>>> https://github.com/mesos/spark-ec2.git ? >>>> >>>>Thanks. >>>> >>>> -- >>>> haitao.yao >>>> >>>> >>>> >>>> >>> >> >> >> -- >> haitao.yao >> >> >> >> > -- haitao.yao
Re: spark_ec2.py for AWS region: cn-north-1, China
I'm afraid not. We have been using EC2 instances in cn-north-1 region for a while. And the latest version of boto has added the region: cn-north-1 Here's the screenshot: >>>> from boto import ec2 >>> ec2.regions() [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1, RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2, RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1, RegionInfo:eu-central-1, RegionInfo:sa-east-1] >>> I do think the doc is out of dated. 2014-11-05 9:45 GMT+08:00 Nicholas Chammas : > > http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html > > cn-north-1 is not a supported region for EC2, as far as I can tell. There > may be other AWS services that can use that region, but spark-ec2 relies on > EC2. > > Nick > > On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao wrote: > >> Hi, >>Amazon aws started to provide service for China mainland, the region >> name is cn-north-1. But the script spark provides: spark_ec2.py will query >> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and >> there's no ami information for cn-north-1 region . >>Can anybody update the ami information and update the reo: >> https://github.com/mesos/spark-ec2.git ? >> >>Thanks. >> >> -- >> haitao.yao >> >> >> >> > -- haitao.yao
spark_ec2.py for AWS region: cn-north-1, China
Hi, Amazon aws started to provide service for China mainland, the region name is cn-north-1. But the script spark provides: spark_ec2.py will query ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's no ami information for cn-north-1 region . Can anybody update the ami information and update the reo: https://github.com/mesos/spark-ec2.git ? Thanks. -- haitao.yao
Re: Driver OOM while using reduceByKey
Thanks. it worked. 2014-05-30 1:53 GMT+08:00 Matei Zaharia : > That hash map is just a list of where each task ran, it’s not the actual > data. How many map and reduce tasks do you have? Maybe you need to give the > driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, > 100) to use only 100 tasks). > > Matei > > On May 29, 2014, at 2:03 AM, haitao .yao wrote: > > > Hi, > > > > I used 1g memory for the driver java process and got OOM error on > driver side before reduceByKey. After analyzed the heap dump, the biggest > object is org.apache.spark.MapStatus, which occupied over 900MB memory. > > > > Here's my question: > > > > > > 1. Is there any optimization switches that I can tune to avoid this? I > have used the compression on output with spark.io.compression.codec. > > > > 2. Why the workers send all the data back to driver to run reduceByKey? > With the current implementation, if I use reduceByKey on TBs of data, that > will be a disaster for driver. Maybe I'm wrong about the assumption of the > spark implementation. > > > > > > And here's my code snippet: > > > > > > ``` > > > > val cntNew = spark.accumulator(0) > > > > val cntOld = spark.accumulator(0) > > > > val cntErr = spark.accumulator(0) > > > > > > val sequenceFileUrl = args(0) > > > > val seq = spark.sequenceFile[Text, BytesWritable](sequenceFileUrl) > > > > val stat = seq.map(pair => convertData( > > > > pair._2, cntNew, cntOld, cntErr > > > > )).reduceByKey(_ + _) > > > > stat.saveAsSequenceFile(args(1) > > > > ``` > > > > > > Thanks. > > > > > > -- > > > > haitao.yao@China > > -- haitao.yao@Beijing
Driver OOM while using reduceByKey
Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object is org.apache.spark.MapStatus, which occupied over 900MB memory. Here's my question: 1. Is there any optimization switches that I can tune to avoid this? I have used the compression on output with spark.io.compression.codec. 2. Why the workers send all the data back to driver to run reduceByKey? With the current implementation, if I use reduceByKey on TBs of data, that will be a disaster for driver. Maybe I'm wrong about the assumption of the spark implementation. And here's my code snippet: ``` val cntNew = spark.accumulator(0) val cntOld = spark.accumulator(0) val cntErr = spark.accumulator(0) val sequenceFileUrl = args(0) val seq = spark.sequenceFile[Text, BytesWritable](sequenceFileUrl) val stat = seq.map(pair => convertData( pair._2, cntNew, cntOld, cntErr )).reduceByKey(_ + _) stat.saveAsSequenceFile(args(1) ``` Thanks. -- haitao.yao@China