[jira] [Commented] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908738#comment-14908738 ] Ben Duffield commented on SPARK-10635: -- Ok good flag that there are other places this'd need to be considered. How open would you be to a PR which addresses this? I.e. sure it's an assumption now - could we move away from that? > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802950#comment-14802950 ] Ben Duffield edited comment on SPARK-10635 at 9/17/15 2:04 PM: --- Curious as to why you believe this to be hard to support? We've been using this at many places for quite a long time without issue prior to 1.4. I guess there's the question of how to plumb the correct host to _load_from_socket. I'm also not aware of the reasons of changing the ServerSocket in python rdd serveIterator to listen explicitly on localhost. These are the only two places though I believe need to change. The alternative is for us to put proxying into the application itself (the application acting as driver) and then monkeypatching pyspark as before, but this isn't ideal. was (Author: bavardage): Curious as to why you believe this to be hard to support? We've been using this at many places for quite a long time without issue prior to 1.4. I guess there's the question of how to plumb the correct host to _load_from_socket. I'm also not aware of the reasons of changing the ServerSocket in python rdd serveIterator to listen explicitly on localhost. These are the only two places though I believe need to change. The alternative is for us to put proxying into the application itself (the application acting as driver) and then monkeypatching pyspark as before, but this isn't ideal. > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802950#comment-14802950 ] Ben Duffield commented on SPARK-10635: -- Curious as to why you believe this to be hard to support? We've been using this at many places for quite a long time without issue prior to 1.4. I guess there's the question of how to plumb the correct host to _load_from_socket. I'm also not aware of the reasons of changing the ServerSocket in python rdd serveIterator to listen explicitly on localhost. These are the only two places though I believe need to change. The alternative is for us to put proxying into the application itself (the application acting as driver) and then monkeypatching pyspark as before, but this isn't ideal. > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {code} > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Duffield updated SPARK-10635: - Description: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: {code} def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket {/code} was: At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10635) pyspark - running on a different host
Ben Duffield created SPARK-10635: Summary: pyspark - running on a different host Key: SPARK-10635 URL: https://issues.apache.org/jira/browse/SPARK-10635 Project: Spark Issue Type: Improvement Reporter: Ben Duffield At various points we assume we only ever talk to a driver on the same host. e.g. https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 We use pyspark to connect to an existing driver (i.e. do not let pyspark launch the driver itself, but instead construct the SparkContext with the gateway and jsc arguments. There are a few reasons for this, but essentially it's to allow more flexibility when running in AWS. Before 1.3.1 we were able to monkeypatch around this: def _load_from_socket(port, serializer): sock = socket.socket() sock.settimeout(3) try: sock.connect((host, port)) rf = sock.makefile("rb", 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() pyspark.rdd._load_from_socket = _load_from_socket -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org