peay created SPARK-21551:
----------------------------

             Summary: pyspark's collect fails when getaddrinfo is too slow
                 Key: SPARK-21551
                 URL: https://issues.apache.org/jira/browse/SPARK-21551
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.1.0
            Reporter: peay
            Priority: Critical


Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
{{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
and having Python connect to it to download the data.

All three are implemented along the lines of:

{code}
port = self._jdf.collectToPython()
return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
{code}

The server has **a hardcoded timeout of 3 seconds** 
(https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
 -- i.e., the Python process has 3 seconds to connect to it from the very 
moment the driver server starts.

In general, that seems fine, but I have been encountering frequent timeouts 
leading to `Exception: could not open socket`.

After investigating a bit, it turns out that {{_load_from_socket}} makes a call 
to {{getaddrinfo}}:

{code}
def _load_from_socket(port, serializer):
    sock = None
    # Support for both IPv4 and IPv6.
    # On most of IPv6-ready systems, IPv6 will take precedence.
    for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
       .. connect ..
{code}

I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
only take a couple milliseconds, about 10% of them take between 2 and 10 
seconds, leading to about 10% of jobs failing. I don't think we can always 
expect {{getaddrinfo}} to return instantly. More generally, Python may 
sometimes pause for a couple seconds, which may not leave enough time for the 
process to connect to the server.

Especially since the server timeout is hardcoded, I think it would be best to 
set a rather generous value (15 seconds?) to avoid such situations.

A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
it before starting up the driver server.
 
cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to