Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-12-02 Thread Andy Davidson
Hi Ted an Felix



From:  Ted Yu <yuzhih...@gmail.com>
Date:  Sunday, November 29, 2015 at 10:37 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  Felix Cheung <felixcheun...@hotmail.com>, "user @spark"
<user@spark.apache.org>
Subject:  Re: possible bug spark/python/pyspark/rdd.py portable_hash()

> I think you should file a bug.


Please feel free to update this bug report

https://issues.apache.org/jira/browse/SPARK-12100


Andy

>  




Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Andy Davidson
Hi Felix and Ted

This is how I am starting spark

Should I file a bug?

Andy


export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"


$SPARK_ROOT/bin/pyspark \

--master $MASTER_URL \

--total-executor-cores $numCores \

--driver-memory 2G \

--executor-memory 2G \

$extraPkgs \

$*


From:  Felix Cheung <felixcheun...@hotmail.com>
Date:  Saturday, November 28, 2015 at 12:11 AM
To:  Ted Yu <yuzhih...@gmail.com>
Cc:  Andrew Davidson <a...@santacruzintegration.com>, "user @spark"
<user@spark.apache.org>
Subject:  Re: possible bug spark/python/pyspark/rdd.py portable_hash()

>  
> Ah, it's there in spark-submit and pyspark.
> Seems like it should be added for spark_ec2
> 
> 
>  
> _
> From: Ted Yu <yuzhih...@gmail.com>
> Sent: Friday, November 27, 2015 11:50 AM
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark
> <user@spark.apache.org>
> 
> 
> 
>ec2/spark-ec2 calls ./ec2/spark_ec2.py
>
> 
>
>
> I don't see PYTHONHASHSEED defined in any of these scripts.
>
> 
>
>
> Andy reported this for ec2 cluster.
>
> 
>
>
> I think a JIRA should be opened.
>
> 
>
>   
>   
>
>
> On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung
> <felixcheun...@hotmail.com> wrote:
> 
>>   
>>   
>>May I ask how you are starting Spark?
>> It looks like PYTHONHASHSEED is being set:
>> https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED
>>
>> 
>>
>> 
>> 
>> Date: Thu, 26 Nov 2015 11:30:09 -0800
>> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
>> From: a...@santacruzintegration.com
>> To: user@spark.apache.org
>> 
>> 
>>  I am using  spark-1.5.1-bin-hadoop2.6. I used
>> spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster
>> and configured spark-env to use python3. I get and exception '
>> Randomness of hash of string should be disabled via PYTHONHASHSEED¹.
>> Is there any reason rdd.py should not just set PYTHONHASHSEED ?
>> 
>>  
>> 
>> 
>>  Should I file a bug?
>> 
>>  
>> 
>> 
>>  Kind regards
>> 
>>  
>> 
>> 
>>  Andy   
>> 
>>  
>> 
>> 
>>  details
>> 
>>  
>> 
>> 
>>  
>> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtrac
>> t#pyspark.RDD.subtract
>> 
>>  
>> 
>> 
>>  Example does not work out of the box
>> 
>>  
>> 
>> 
>>   Subtract(   other,
>> numPartitions=None)
>> <http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtra
>> ct#pyspark.RDD.subtract>
>> 
>> Return each value in self that is not contained in other.
>>
>> 
>>  
>>>>> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y =
>>>>> sc.parallelize([("a", 3), ("c", None)])>>>
>>>>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>> 
>>
>> 
>> It raises   
>>  
>> 
>> 
>>  
>> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
>> raise Exception("Randomness of hash of string should be disabled via
>> PYTHONHASHSEED")
>> 
>> 
>>  
>> 
>> 
>>  
>> 
>> 
>>  The following script fixes the problem
>> 
>>  
>> 
>> 
>>  Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
>> Exception'Randomness of hash of string should be disabled via
>> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >>
>> /root/spark/conf/spark-env.sh
>> 
>>  
>> 
>> 
>>  sudo pssh -i -h /root/spark-ec2/slaves cp
>> /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date
>> "+%Y-%m-%d:%H:%M"`
>> 
>>  
>> 
>> 
>>  Sudo for i in `cat slaves` ; do scp spark-env.sh
>> root@$i:/root/spark/conf/spark-env.sh; done
>> 
>>  
>> 
>> 
>>  
>> 
>> 
>>  
>> 
>>
>>
>>  
>>  
>
>
>   
>   
> 
>  




Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Ted Yu
I think you should file a bug. 

> On Nov 29, 2015, at 9:48 AM, Andy Davidson <a...@santacruzintegration.com> 
> wrote:
> 
> Hi Felix and Ted
> 
> This is how I am starting spark
> 
> Should I file a bug?
> 
> Andy
> 
> 
> export PYSPARK_PYTHON=python3.4
> export PYSPARK_DRIVER_PYTHON=python3.4
> export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  
> 
> $SPARK_ROOT/bin/pyspark \
> --master $MASTER_URL \
> --total-executor-cores $numCores \
> --driver-memory 2G \
> --executor-memory 2G \
> $extraPkgs \
> $*
> 
> From: Felix Cheung <felixcheun...@hotmail.com>
> Date: Saturday, November 28, 2015 at 12:11 AM
> To: Ted Yu <yuzhih...@gmail.com>
> Cc: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" 
> <user@spark.apache.org>
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> 
> Ah, it's there in spark-submit and pyspark.
> Seems like it should be added for spark_ec2
> 
> 
> _____
> From: Ted Yu <yuzhih...@gmail.com>
> Sent: Friday, November 27, 2015 11:50 AM
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
> <user@spark.apache.org>
> 
> 
> ec2/spark-ec2 calls ./ec2/spark_ec2.py 
> 
> I don't see PYTHONHASHSEED defined in any of these scripts.
> 
> Andy reported this for ec2 cluster.
> 
> I think a JIRA should be opened.
> 
> 
>> On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung <felixcheun...@hotmail.com> 
>> wrote: 
>> May I ask how you are starting Spark? 
>> It looks like PYTHONHASHSEED is being set: 
>> https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED 
>> 
>>   
>> Date: Thu, 26 Nov 2015 11:30:09 -0800 
>> Subject: possible bug spark/python/pyspark/rdd.py portable_hash() 
>> From: a...@santacruzintegration.com 
>> To: user@spark.apache.org 
>> 
>> I am using  spark-1.5.1-bin-hadoop2.6. I used  
>> spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  and configured 
>> spark-env to use python3. I get and exception '  Randomness of hash of 
>> string should be disabled via PYTHONHASHSEED’.   Is there any reason rdd.py 
>> should not just set PYTHONHASHSEED ?
>> 
>>  Should I file a bug?
>> 
>> Kind regards
>> 
>> Andy
>> 
>> details
>> 
>> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
>> 
>> Example does not work out of the box
>> 
>> Subtract( other,  numPartitions=None) 
>> Return each value in self that is not contained in other.
>> 
>> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
>> >>> sc.parallelize([("a", 3), ("c", None)])>>> 
>> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>> It raises 
>> 
>> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
>> raise Exception("Randomness of hash of string should be disabled via 
>> PYTHONHASHSEED")
>> 
>> 
>> The following script fixes the problem 
>> 
>>  Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
>> Exception'Randomness of hash of string should be disabled via 
>> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> 
>> /root/spark/conf/spark-env.sh 
>> 
>>  sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh 
>> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` 
>> 
>>  Sudo for i in `cat slaves` ; do scp spark-env.sh 
>> root@$i:/root/spark/conf/spark-env.sh; done
> 
> 
> 


RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Felix Cheung
Actually upon closer look PYTHONHASHSEED should be set (in worker) when your 
create a SparkContext
 
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L166
 
And it should also be set from spark-submit or pyspark.
 
Can you check sys.version and os.environ.get("PYTHONHASHSEED")?
 
Date: Sun, 29 Nov 2015 09:48:19 -0800
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: felixcheun...@hotmail.com; yuzhih...@gmail.com
CC: user@spark.apache.org

Hi Felix and Ted
This is how I am starting spark
Should I file a bug?
Andy

export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  
$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*
From:  Felix Cheung <felixcheun...@hotmail.com>
Date:  Saturday, November 28, 2015 at 12:11 AM
To:  Ted Yu <yuzhih...@gmail.com>
Cc:  Andrew Davidson <a...@santacruzintegration.com>, "user @spark" 
<user@spark.apache.org>
Subject:  Re: possible bug spark/python/pyspark/rdd.py portable_hash()


Ah, it's there in spark-submit and pyspark.Seems like it should be added 
for spark_ec2



_
From: Ted Yu <yuzhih...@gmail.com>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
<user@spark.apache.org>


   ec2/spark-ec2 calls ./ec2/spark_ec2.py   
   
  I don't see PYTHONHASHSEED defined in any of these scripts.  
  Andy reported this for ec2 cluster.  
  I think a JIRA should be opened.  
  
   On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
<felixcheun...@hotmail.com> wrote:
   May I ask how you are starting Spark?   
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED   
   

   Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

 I am using  spark-1.5.1-bin-hadoop2.6. I used  
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  
and configured spark-env to use python3. I get and exception ' 
Randomness of hash of string should be disabled via PYTHONHASHSEED’.  
Is there any reason rdd.py should not just set PYTHONHASHSEED ? 

 Should I file a bug? 
 Kind regards 
 Andy 
 details 
 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
 
 Example does not work out of the box   
  
  Subtract(   other,
numPartitions=None)  Return each 
value in self that is not contained in other.   
 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
sc.parallelize([("a", 3), ("c", None)])>>> 
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]   
It raises  
 if sys.version >= '3.3' and 'PYTHONHASHSEED' not 
in os.environ:raise Exception("Randomness of hash of string should be 
disabled via PYTHONHASHSEED") 
 
 The following script fixes the problem 
 
 Sudo printf "\n# set PYTHONHASHSEED so python3 will 
not generate Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh  

 sudo pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
"+%Y-%m-%d:%H:%M"`  
 Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done 
 
 
  



  

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-28 Thread Felix Cheung
Ah, it's there in spark-submit and pyspark.Seems like it should be added for 
spark_ec2



_
From: Ted Yu <yuzhih...@gmail.com>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
<user@spark.apache.org>


   ec2/spark-ec2 calls ./ec2/spark_ec2.py   
   
  I don't see PYTHONHASHSEED defined in any of these scripts.  
  Andy reported this for ec2 cluster.  
  I think a JIRA should be opened.  
  
   On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
<felixcheun...@hotmail.com> wrote:
   May I ask how you are starting Spark?   
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED   
   
    
   Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

 I am using  spark-1.5.1-bin-hadoop2.6. I used  
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  
and configured spark-env to use python3. I get and exception ' 
Randomness of hash of string should be disabled via PYTHONHASHSEED’.  
Is there any reason rdd.py should not just set PYTHONHASHSEED ? 

 Should I file a bug? 
 Kind regards 
 Andy 
 details 
 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
 
 Example does not work out of the box   
  
  Subtract(   other,    
numPartitions=None)  

Return each value in self that is not contained in other.   
 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])>>> y = sc.parallelize([("a", 3), ("c", None)])>>> 
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]   
It raises  
 if sys.version >= '3.3' and 'PYTHONHASHSEED' not 
in os.environ:raise Exception("Randomness of hash of string should be 
disabled via PYTHONHASHSEED") 
 
 The following script fixes the problem 
 
 Sudo printf "
# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash 
of string should be disabled via PYTHONHASHSEED'
export PYTHONHASHSEED=123
" >> /root/spark/conf/spark-env.sh  
 sudo pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
"+%Y-%m-%d:%H:%M"`  
 Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done 
 
 
  



  

RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Felix Cheung
May I ask how you are starting Spark?
It looks like PYTHONHASHSEED is being set: 
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED

 
Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I get and exception 'Randomness of hash of string 
should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
just set PYTHONHASHSEED ?
Should I file a bug?
Kind regards
Andy
details
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
Example does not work out of the box
Subtract(other, numPartitions=None)Return each value in self that is not 
contained in other.>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")


The following script fixes the problem 
Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done


  

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Ted Yu
ec2/spark-ec2 calls ./ec2/spark_ec2.py

I don't see PYTHONHASHSEED defined in any of these scripts.

Andy reported this for ec2 cluster.

I think a JIRA should be opened.


On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
wrote:

> May I ask how you are starting Spark?
> It looks like PYTHONHASHSEED is being set:
> https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED
>
>
> --
> Date: Thu, 26 Nov 2015 11:30:09 -0800
> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
> From: a...@santacruzintegration.com
> To: user@spark.apache.org
>
> I am using spark-1.5.1-bin-hadoop2.6. I used
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and
> configured spark-env to use python3. I get and exception 'Randomness of
> hash of string should be disabled via PYTHONHASHSEED’. Is there any
> reason rdd.py should not just set PYTHONHASHSEED ?
>
> Should I file a bug?
>
> Kind regards
>
> Andy
>
> details
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
>
> Example does not work out of the box
>
> Subtract(*other*, *numPartitions=None*)
> 
>
> Return each value in self that is not contained in other.
>
> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
> >>> sc.parallelize([("a", 3), ("c", None)])>>> 
> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>
> It raises
>
> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
>
>
>
> *The following script fixes the problem *
>
> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
> Exception'Randomness of hash of string should be disabled via
> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf
> /spark-env.sh
>
> sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh
> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`
>
> Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf
> /spark-env.sh; done
>
>
>
>