unsubscribe

2020-03-07 Thread haitao .yao
unsubscribe

-- 
haitao.yao


add support for separate GC log files for different executor

2014-11-05 Thread haitao .yao
Hey, guys. Here's my problem:
While using the standalone mode, I always use the following args for
executor:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc
-Xloggc:/tmp/spark.executor.gc.log

​
But as we know, hotspot JVM does not support variable substitution on
-Xloggc parameter, which will cause gc log be overwritten by other later
executors.

May I create a new path, which will add variable substitution before worker
forks a new executor to avoid GC log overwriteen?

First thoughts: configure the executor jvm args like this:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc
-Xloggc:/tmp/spark.executor.%applicationId%.gc.log

​
and this will replace the %applicationId% with the current application ID
and pass the final args into java command line

We can support more variables such as executorId

Thanks.
-- 
haitao.yao


Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241

Thanks.

2014-11-05 10:58 GMT+08:00 Nicholas Chammas :

> Oh, I can see that region via boto as well. Perhaps the doc is indeed out
> of date.
>
> Do you mind opening a JIRA issue
> <https://issues.apache.org/jira/secure/Dashboard.jspa> to track this
> request? I can do it if you've never opened a JIRA issue before.
>
> Nick
>
> On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao  wrote:
>
>> I'm afraid not. We have been using EC2 instances in cn-north-1 region for
>> a while. And the latest version of boto has added the region: cn-north-1
>> Here's the  screenshot:
>> >>>> from  boto import ec2
>> >>> ec2.regions()
>> [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
>> RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
>> RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
>> RegionInfo:eu-central-1, RegionInfo:sa-east-1]
>> >>>
>>
>> I do think the doc is out of dated.
>>
>>
>>
>> 2014-11-05 9:45 GMT+08:00 Nicholas Chammas :
>>
>>>
>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>>>
>>> cn-north-1 is not a supported region for EC2, as far as I can tell.
>>> There may be other AWS services that can use that region, but spark-ec2
>>> relies on EC2.
>>>
>>> Nick
>>>
>>> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>>>
>>>> Hi,
>>>>Amazon aws started to provide service for China mainland, the region
>>>> name is cn-north-1. But the script spark provides: spark_ec2.py will query
>>>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
>>>> there's no ami information for cn-north-1 region .
>>>>Can anybody update the ami information and update the reo:
>>>> https://github.com/mesos/spark-ec2.git ?
>>>>
>>>>Thanks.
>>>>
>>>> --
>>>> haitao.yao
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> haitao.yao
>>
>>
>>
>>
>


-- 
haitao.yao


Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
I'm afraid not. We have been using EC2 instances in cn-north-1 region for a
while. And the latest version of boto has added the region: cn-north-1
Here's the  screenshot:
>>>> from  boto import ec2
>>> ec2.regions()
[RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
RegionInfo:eu-central-1, RegionInfo:sa-east-1]
>>>

I do think the doc is out of dated.



2014-11-05 9:45 GMT+08:00 Nicholas Chammas :

>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>
> cn-north-1 is not a supported region for EC2, as far as I can tell. There
> may be other AWS services that can use that region, but spark-ec2 relies on
> EC2.
>
> Nick
>
> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>
>> Hi,
>>Amazon aws started to provide service for China mainland, the region
>> name is cn-north-1. But the script spark provides: spark_ec2.py will query
>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
>> there's no ami information for cn-north-1 region .
>>Can anybody update the ami information and update the reo:
>> https://github.com/mesos/spark-ec2.git ?
>>
>>Thanks.
>>
>> --
>> haitao.yao
>>
>>
>>
>>
>


-- 
haitao.yao


spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Hi,
   Amazon aws started to provide service for China mainland, the region
name is cn-north-1. But the script spark provides: spark_ec2.py will query
ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's
no ami information for cn-north-1 region .
   Can anybody update the ami information and update the reo:
https://github.com/mesos/spark-ec2.git ?

   Thanks.

-- 
haitao.yao


Re: Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Thanks. it worked.


2014-05-30 1:53 GMT+08:00 Matei Zaharia :

> That hash map is just a list of where each task ran, it’s not the actual
> data. How many map and reduce tasks do you have? Maybe you need to give the
> driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _,
> 100) to use only 100 tasks).
>
> Matei
>
> On May 29, 2014, at 2:03 AM, haitao .yao  wrote:
>
> > Hi,
> >
> >  I used 1g memory for the driver java process and got OOM error on
> driver side before reduceByKey. After analyzed the heap dump, the biggest
> object is org.apache.spark.MapStatus, which occupied over 900MB memory.
> >
> > Here's my question:
> >
> >
> > 1. Is there any optimization switches that I can tune to avoid this? I
> have used the compression on output with spark.io.compression.codec.
> >
> > 2. Why the workers send all the data back to driver to run reduceByKey?
> With the current implementation, if I use reduceByKey on TBs of data, that
> will be a disaster for driver. Maybe I'm wrong about the assumption of the
> spark implementation.
> >
> >
> > And here's my code snippet:
> >
> >
> > ```
> >
> > val cntNew = spark.accumulator(0)
> >
> > val cntOld = spark.accumulator(0)
> >
> > val cntErr = spark.accumulator(0)
> >
> >
> > val sequenceFileUrl = args(0)
> >
> > val seq = spark.sequenceFile[Text, BytesWritable](sequenceFileUrl)
> >
> > val stat = seq.map(pair => convertData(
> >
> >   pair._2, cntNew, cntOld, cntErr
> >
> > )).reduceByKey(_ + _)
> >
> > stat.saveAsSequenceFile(args(1)
> >
> > ```
> >
> >
> > Thanks.
> >
> >
> > --
> >
> > haitao.yao@China
>
>


-- 
haitao.yao@Beijing


Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Hi,

 I used 1g memory for the driver java process and got OOM error on
driver side before reduceByKey. After analyzed the heap dump, the biggest
object is org.apache.spark.MapStatus, which occupied over 900MB memory.

Here's my question:


1. Is there any optimization switches that I can tune to avoid this? I have
used the compression on output with spark.io.compression.codec.

2. Why the workers send all the data back to driver to run reduceByKey?
With the current implementation, if I use reduceByKey on TBs of data, that
will be a disaster for driver. Maybe I'm wrong about the assumption of the
spark implementation.


And here's my code snippet:


```

val cntNew = spark.accumulator(0)

val cntOld = spark.accumulator(0)

val cntErr = spark.accumulator(0)


val sequenceFileUrl = args(0)

val seq = spark.sequenceFile[Text, BytesWritable](sequenceFileUrl)

val stat = seq.map(pair => convertData(

  pair._2, cntNew, cntOld, cntErr

)).reduceByKey(_ + _)

stat.saveAsSequenceFile(args(1)

```


Thanks.


-- 

haitao.yao@China