[ https://issues.apache.org/jira/browse/SPARK-38858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-38858. ---------------------------------- Resolution: Not A Problem > PythonException - socke.timeout: timed out - socket.py line 707 > --------------------------------------------------------------- > > Key: SPARK-38858 > URL: https://issues.apache.org/jira/browse/SPARK-38858 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 3.2.1 > Environment: Intel i7 core > 64Gb ram ( 30Gb assigned to spark executor memory) > 4 cores > Windows 11 > Working in jupyter notebook > Python version - 3.9.7 > Apache Spark version - 3.2.1 > Reporter: Alex Veale > Priority: Major > Labels: test > Attachments: socketError - timed out.png, socketpy.png > > > I have a database of about 8 million residential addresses address. I perform > 3 separate cleaning operations on the data using udf's and regular > expressions (python re package). Then I create an additional column by > splitting the 'cleaned' address by commas and then taking the object in the > last index as the suburb and use this column as a key to joining the original > data frame to a supplementary 1 which contains suburb and country pairs, > joining on suburb and then finally create another column containing the final > address with the 'unsplit clean' address column concatenated with the country > column pulled in by the join. > When I try to display the result by calling show, I get the desired result if > I show only the first 1000 records or less, however if I try to show more > records or I add an additional filter to only display records that have been > modified, I get a socket timeout error. > I have tried to increase the socket's send and receive buffer sizes to the > maximum of 1048576 bytes, as well as increasing the spark executor heartbeat > interval (7200s )as well as the spark network timeout (3600s); and I have > tried repartitioning the data to 16 and 32 partitions, all of which have had > no impact on the result. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org