[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681892#comment-16681892
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

Yep, the pyspark job completes fine afetr we removed ipv6 references in 
/etc/hosts 

Thank you both 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680920#comment-16680920
 ] 

Yuanjian Li commented on SPARK-25958:
-

[~Tagar] Yep, you should also comment out IPv6 hosts in /etc/hosts. As the note 
2 here: https://wiki.archlinux.org/index.php/IPv6#Disable_IPv6
[~hyukjin.kwon] Thanks for your reply, I think this is not an issue of Spark.

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680811#comment-16680811
 ] 

Hyukjin Kwon commented on SPARK-25958:
--

[~XuanYuan], FWIW, I used that way to fix a similar problem before.

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside of PySpark 
> and it worked fine.
> RHEL 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679261#comment-16679261
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

[~XuanYuan] interesting.. here's our /etc/hosts:
{quote}127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
{quote}

Notice we have ipv6 stuff there, but ipv6 is disabled for us.

I will comment out `::1` and try again. 

Was it the fix for you too? 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679169#comment-16679169
 ] 

Yuanjian Li commented on SPARK-25958:
-

We also meet this problem in internal folk and fixed by re-config `/etc/hosts`. 
If something wrong with server config, 
the `[Error 97]` will simply reproduce by code
{code}
import socket 
socket.create_connection(('localhost', 8000)) 
{code}
What's the `/etc/hosts` in your problem host currently?

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679263#comment-16679263
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

I just removed ipv6 reference ::1 in /etc/hosts and your sample code stopped 
reporting "OSError: [Errno 97] Address family not supported by protocol".

Will try to rerun the job now. 

Thank you.

 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678894#comment-16678894
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

We do have ipv6 disabled on our hadoop servers, but that failing line in 
lib/spark2/python/pyspark/rdd.py just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused