[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind to public IP on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148705#comment-15148705 ] Christopher Bourez edited comment on SPARK-13317 at 2/16/16 2:54 PM: - I'm trying my best, second time, but when I specify the public IP with {code}SPARK_PUBLIC_IP{code}in spark-env.sh and restart, I get an error during spark context initialization in spark-shell : {code} 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext. java.net.BindException: {code} Do you have any clue ? was (Author: christopher5106): I'm trying my best, second time, but when I specify the public IP with {code}SPARK_PUBLIC_ID{code} I get an error during spark context initialization in spark-shell : {code} 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext. java.net.BindException: {code} Do you have any clue ? > SPARK_LOCAL_IP does not bind to public IP on Slaves > --- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Components: Deploy, EC2 > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez >Priority: Minor > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind to public IP on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148705#comment-15148705 ] Christopher Bourez commented on SPARK-13317: I'm trying my best, second time, but when I specify the public IP with {code}SPARK_PUBLIC_ID{code} I get an error during spark context initialization in spark-shell : {code} 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1. 16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext. java.net.BindException: {code} Do you have any clue ? > SPARK_LOCAL_IP does not bind to public IP on Slaves > --- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Components: Deploy, EC2 > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez >Priority: Minor > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147157#comment-15147157 ] Christopher Bourez commented on SPARK-13317: Let me give a check again but as far as I remember {{SPARK_PUBLIC_DNS}} is already set by default to public DNS in the conf of the slave, that's why the web UI works well > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147138#comment-15147138 ] Christopher Bourez commented on SPARK-13317: To confirm, I stop all with stop-all.sh. Then I set SPARK_LOCAL_IP to the public IP on all instances. And then I run again start-all.sh. Am I missing something ? > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146704#comment-15146704 ] Christopher Bourez commented on SPARK-13317: Because installing the notebooks Zeppelin or IScala on the cluster does not make a lot of sense. > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:02 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP of the slaves when I execute a job in client mode (spark-shell or zeppelin on my macbook). I think we should be able to work from different networks. Only UI interfaces seem to be bound to the correct IP. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:01 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slaves, to set SPARK_LOCAL_IP in the slaves' spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP of the slaves. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerM
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 7:00 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tried to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. was (Author: christopher5106): I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint:
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 w
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster {code} ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster {code} which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to {code} spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 {code} I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : {code} 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block m
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:59 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : {code} 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager {code} which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, B
[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez edited comment on SPARK-13317 at 2/14/16 6:58 PM: - I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager ` which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, was (Author: christopher5106): I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ``` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockMana
[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
[ https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146689#comment-15146689 ] Christopher Bourez commented on SPARK-13317: I launch a cluster ./ec2/spark-ec2 -k sparkclusterkey -i ~/sparkclusterkey.pem --region=eu-west-1 --copy-aws-credentials --instance-type=m1.large -s 4 --hadoop-major-version=2 launch spark-cluster which gives me a master at ec2-54-229-16-73.eu-west-1.compute.amazonaws.com and slaves at ec2-54-194-99-236.eu-west-1.compute.amazonaws.com etc If I launch a job in client mode from another network, for example in a Zeppelin notebook on my macbook, which configuration is equivalent to spark-shell --master=spark://ec2-54-229-16-73.eu-west-1.compute.amazonaws.com:7077 I see in the logs : ``` 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/0 on worker-20160214185030-172.31.4.179-34425 (172.31.4.179:34425) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/0 on hostPort 172.31.4.179:34425 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/1 on worker-20160214185030-172.31.4.176-47657 (172.31.4.176:47657) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/1 on hostPort 172.31.4.176:47657 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/2 on worker-20160214185031-172.31.4.177-41379 (172.31.4.177:41379) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/2 on hostPort 172.31.4.177:41379 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO AppClient$ClientEndpoint: Executor added: app-20160214185504-/3 on worker-20160214185032-172.31.4.178-34353 (172.31.4.178:34353) with 2 cores 16/02/14 19:55:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160214185504-/3 on hostPort 172.31.4.178:34353 with 2 cores, 1024.0 MB RAM 16/02/14 19:55:04 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:64058 with 511.5 MB RAM, BlockManagerId(driver, 192.168.1.11, 64058) 16/02/14 19:55:04 INFO BlockManagerMaster: Registered BlockManager ``` which are private IP that my macbook cannot access and when launching a job, an error follow : 16/02/14 19:57:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I tryied to connect to the slave, to set SPARK_LOCAL_IP in the slave's spark-env.sh, stop and restart all slaves from the master, spark master still returns the private IP. Thanks, > SPARK_LOCAL_IP does not bind on Slaves > -- > > Key: SPARK-13317 > URL: https://issues.apache.org/jira/browse/SPARK-13317 > Project: Spark > Issue Type: Bug > Environment: Linux EC2, different VPC >Reporter: Christopher Bourez > > SPARK_LOCAL_IP does not bind to the provided IP on slaves. > When launching a job or a spark-shell from a second network, the returned IP > for the slave is still the first IP of the slave. > So the job fails with the message : > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > It is not a question of resources but the driver which cannot connect to the > slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13317) SPARK_LOCAL_IP does not bind on Slaves
Christopher Bourez created SPARK-13317: -- Summary: SPARK_LOCAL_IP does not bind on Slaves Key: SPARK-13317 URL: https://issues.apache.org/jira/browse/SPARK-13317 Project: Spark Issue Type: Bug Environment: Linux EC2, different VPC Reporter: Christopher Bourez SPARK_LOCAL_IP does not bind to the provided IP on slaves. When launching a job or a spark-shell from a second network, the returned IP for the slave is still the first IP of the slave. So the job fails with the message : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources It is not a question of resources but the driver which cannot connect to the slave given the wrong IP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144901#comment-15144901 ] Christopher Bourez commented on SPARK-12261: Sean, how can I get the executor log in local mode ? Thanks > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144873#comment-15144873 ] Christopher Bourez commented on SPARK-12261: Here is what i see when i activate the logs : 16/02/12 18:09:22 ERROR TaskSetManager: Task 0 in stage 5.0 failed 1 times; abor ting job 16/02/12 18:09:22 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 16/02/12 18:09:22 INFO TaskSchedulerImpl: Cancelling stage 5 16/02/12 18:09:22 INFO DAGScheduler: ResultStage 5 (runJob at PythonRDD.scala:39 3) failed in 0,280 s 16/02/12 18:09:22 INFO DAGScheduler: Job 5 failed: runJob at PythonRDD.scala:393 , took 0,308529 s Traceback (most recent call last): File "", line 1, in File "C:\Documents\c.bourez\Documents\spark-1.5.2-bin-hadoop2.6\spark-1.5.2-bi n-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "C:\Documents\c.bourez\Documents\spark-1.5.2-bin-hadoop2.6\spark-1.5.2-bi n-hadoop2.6\python\pyspark\context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition s) File "C:\Documents\c.bourez\Documents\spark-1.5.2-bin-hadoop2.6\spark-1.5.2-bi n-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__ File "C:\Documents\c.bourez\Documents\spark-1.5.2-bin-hadoop2.6\spark-1.5.2-bi n-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco return f(*a, **kw) File "C:\Documents\c.bourez\Documents\spark-1.5.2-bin-hadoop2.6\spark-1.5.2-bi n-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_ return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark. api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s tage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost): java.net.SocketException: Connection reset by peer: socket write er ror at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$Py thonRDD$$write$1(PythonRDD.scala:442) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:452) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:452) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:452) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:280) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apac
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144844#comment-15144844 ] Christopher Bourez commented on SPARK-12261: Sean Owen, do you reconsider the status as a Spark issue ? > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144793#comment-15144793 ] Christopher Bourez commented on SPARK-12261: Dear Niall Your solution works very well :) Thank you a lot > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136229#comment-15136229 ] Christopher Bourez commented on SPARK-12261: I'm still here if you need any more info about how to reproduce the case > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121182#comment-15121182 ] Christopher Bourez commented on SPARK-12261: There is a strange "remove broadcast variable" operation at the end of the 3 third sc.textFile().take(1) method execution; and then the next executions fail. Can there be a link with this problem : https://spark-project.atlassian.net/browse/SPARK-1065 ? > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121114#comment-15121114 ] Christopher Bourez commented on SPARK-12261: I recompiled Spark on Windows but the problem remains. The first 3 times I launch the textFile command followed by a take(1), it works, but then does not work anymore. The memory (between python and the JVM) sounds not to be release. I tried to re-init sc.stop(), del sc, sc = SparkContext('local','test') import gc, gc.collect() ... does not change. Memory not released. It sounds that OOM are quite common on Windows/Pyspark > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117345#comment-15117345 ] Christopher Bourez commented on SPARK-12261: The solution "Increase driver memory" does not work. > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116907#comment-15116907 ] Christopher Bourez edited comment on SPARK-12261 at 1/26/16 8:11 AM: - To reproduce you can follow the steps : - create an Aws Workspace with Windows 7 (that I can share you if you'd like) with Standard instance, 2GiB RAM On this instance : - download spark (1.5 or 1.6 same pb) with hadoop 2.6 - install java 8 jdk - download python 2.7.8 - download the sample file https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv - launch Pyspark : bin\pyspark --master local[1] - run command : sc.textFile("test.csv").take(1) => fails (very few times worked) - run sc.textFile("test.csv", 2000).take(1) => works Sample file is 13M, has been created randomly for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Running Pyspark with more memory bin\pyspark --master local[1] --conf spark.driver.memory=3g displays more memory in http://localhost:4040/executors but does not change the problem Full video of the problem : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov was (Author: christopher5106): To reproduce you can follow the steps : - create an Aws Workspace with Windows 7 (that I can share you if you'd like) with Standard instance, 2GiB RAM On this instance : - download spark (1.5 or 1.6 same pb) with hadoop 2.6 - install java 8 jdk - download python 2.7.8 - download the sample file https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv - launch Pyspark : bin\pyspark --master local[1] - run command : sc.textFile("test.csv").take(1) => fails (very few times worked) - run sc.textFile("test.csv", 2000).take(1) => works Sample file is 13M, has been created randomly for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Running Pyspark with more memory bin\pyspark --master local[1] --conf spark.driver.memory=3g displays more memory in http://localhost:4040/executors but does not change the problem Full video of the problem : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116907#comment-15116907 ] Christopher Bourez edited comment on SPARK-12261 at 1/26/16 8:10 AM: - To reproduce you can follow the steps : - create an Aws Workspace with Windows 7 (that I can share you if you'd like) with Standard instance, 2GiB RAM On this instance : - download spark (1.5 or 1.6 same pb) with hadoop 2.6 - install java 8 jdk - download python 2.7.8 - download the sample file https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv - launch Pyspark : bin\pyspark --master local[1] - run command : sc.textFile("test.csv").take(1) => fails (very few times worked) - run sc.textFile("test.csv", 2000).take(1) => works Sample file is 13M, has been created randomly for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Running Pyspark with more memory bin\pyspark --master local[1] --conf spark.driver.memory=3g displays more memory in http://localhost:4040/executors but does not change the problem Full video of the problem : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov was (Author: christopher5106): To reproduce you can follow the steps : - create an Aws Workspace with Windows 7 (that I can share you if you'd like) with Standard instance, 2GiB RAM On this instance : - download spark (1.5 or 1.6 same pb) with hadoop 2.6 - install java 8 jdk - download python 2.7.8 - downloaded the sample file https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv - launch Pyspark : bin\pyspark --master local[1] - run command : sc.textFile("test.csv").take(1) => fails (very few times worked) - run sc.textFile("test.csv", 2000).take(1) => works Sample file is 13M, has been created randomly for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Running Pyspark with more memory bin\pyspark --master local[1] --conf spark.driver.memory=3g displays more memory in http://localhost:4040/executors but does not change the problem Full video of the problem : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116907#comment-15116907 ] Christopher Bourez commented on SPARK-12261: To reproduce you can follow the steps : - create an Aws Workspace with Windows 7 (that I can share you if you'd like) with Standard instance, 2GiB RAM On this instance : - download spark (1.5 or 1.6 same pb) with hadoop 2.6 - install java 8 jdk - download python 2.7.8 - downloaded the sample file https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv - launch Pyspark : bin\pyspark --master local[1] - run command : sc.textFile("test.csv").take(1) => fails (very few times worked) - run sc.textFile("test.csv", 2000).take(1) => works Sample file is 13M, has been created randomly for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Running Pyspark with more memory bin\pyspark --master local[1] --conf spark.driver.memory=3g displays more memory in http://localhost:4040/executors but does not change the problem Full video of the problem : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez closed SPARK-12980. -- > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 > v2.42.01 > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115433#comment-15115433 ] Christopher Bourez commented on SPARK-12261: I think the issue is not resolved I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 v2.42.01 What could be the reason to have the windows spark textfile method fail ? was: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 > v2.42.01 > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html What could be the reason to have the windows spark textfile method fail ? was: I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(), it failed and gave error messages including: 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "E:/spark_python/test3.py", line 9, in lines.take(5) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco return f(*a, **kw) File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error Then i ran the same code for a small text file, this time .take() worked fine. How can i solve this problem? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? was: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12980) pyspark crash for large dataset - clone
Christopher Bourez created SPARK-12980: -- Summary: pyspark crash for large dataset - clone Key: SPARK-12980 URL: https://issues.apache.org/jira/browse/SPARK-12980 Project: Spark Issue Type: Bug Affects Versions: 1.5.2 Environment: windows Reporter: Christopher Bourez I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(), it failed and gave error messages including: 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "E:/spark_python/test3.py", line 9, in lines.take(5) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco return f(*a, **kw) File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error Then i ran the same code for a small text file, this time .take() worked fine. How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org