[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-11-07 Thread David Vogelbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679253#comment-16679253
 ] 

David Vogelbacher commented on SPARK-24437:
---

[~eyalfa]  There might be hundreds of cached dataframes at the same time (they 
do get unpersisted after a while, but only when they are very unlikely to be 
used again). 
The thing here is that all the dataframes that are cached are generally quite 
small (~100.000 rows). However, they might be created by a series of joins. So 
at times the broadcasted data for a specific, cached dataframe is likely bigger 
than the cached dataframe itself.

This might be a bit of an unusual use case. I do know of the workarounds you 
proposed, but they would significantly harm perf (disabling broadcast joins is 
not something I want to do for example). 

In this specific example (where the cached dataframes are smaller than the 
broadcasted data), it would really be desirable to clean up the broadcasted 
data and not have it stick around on the driver until the dataframe gets 
uncached.
I still don't quite understand why garbage collecting the broadcasted item 
would lead to failures when executing the plan later (in case parts of the 
cached data got evicted), as executing the plan could always just recompute the 
broadcasted variable? [~mgaido]


> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25970) Add Instrumentation to PrefixSpan

2018-11-07 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-25970:


 Summary: Add Instrumentation to PrefixSpan
 Key: SPARK-25970
 URL: https://issues.apache.org/jira/browse/SPARK-25970
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


Add Instrumentation to PrefixSpan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25676) Refactor BenchmarkWideTable to use main method

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679312#comment-16679312
 ] 

Apache Spark commented on SPARK-25676:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22970

> Refactor BenchmarkWideTable to use main method
> --
>
> Key: SPARK-25676
> URL: https://issues.apache.org/jira/browse/SPARK-25676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25955) Porting JSON test for CSV functions

2018-11-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25955.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22960
[https://github.com/apache/spark/pull/22960]

> Porting JSON test for CSV functions
> ---
>
> Key: SPARK-25955
> URL: https://issues.apache.org/jira/browse/SPARK-25955
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> JsonFunctionsSuite contains test that are applicable and useful for CSV 
> functions - from_csv, to_csv and schema_of_csv:
> * uses DDL strings for defining a schema - java
> * roundtrip to_csv -> from_csv
> * roundtrip from_csv -> to_csv
> * infers schemas of a CSV string and pass to to from_csv
> * Support to_csv in SQL
> * Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678894#comment-16678894
 ] 

Ruslan Dautkhanov edited comment on SPARK-25958 at 11/7/18 10:35 PM:
-

We do have ipv6 disabled on our hadoop servers, but that failing line in 
*lib/spark2/python/pyspark/rdd.py* just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}


was (Author: tagar):
We do have ipv6 disabled on our hadoop servers, but that failing line in 
lib/spark2/python/pyspark/rdd.py just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678894#comment-16678894
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

We do have ipv6 disabled on our hadoop servers, but that failing line in 
lib/spark2/python/pyspark/rdd.py just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused 

[jira] [Updated] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25958:
--
Issue Type: Bug  (was: New Feature)

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside of PySpark 
> and it worked fine.
> RHEL 7.5.



--
This message was sent by Atlassian JIRA

[jira] [Assigned] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25956:


Assignee: (was: Apache Spark)

> Make Scala 2.12 as default Scala version in Spark 3.0
> -
>
> Key: SPARK-25956
> URL: https://issues.apache.org/jira/browse/SPARK-25956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>
> Scala 2.11 will unlikely support Java 11 
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, 
> we will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678848#comment-16678848
 ] 

Apache Spark commented on SPARK-25956:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22967

> Make Scala 2.12 as default Scala version in Spark 3.0
> -
>
> Key: SPARK-25956
> URL: https://issues.apache.org/jira/browse/SPARK-25956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>
> Scala 2.11 will unlikely support Java 11 
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, 
> we will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25956:


Assignee: Apache Spark

> Make Scala 2.12 as default Scala version in Spark 3.0
> -
>
> Key: SPARK-25956
> URL: https://issues.apache.org/jira/browse/SPARK-25956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> Scala 2.11 will unlikely support Java 11 
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, 
> we will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

2018-11-07 Thread Adam Budde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678861#comment-16678861
 ] 

Adam Budde commented on SPARK-25925:


[~axenol] I would definitely support making the documentation clearer in this 
instance.

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> ---
>
> Key: SPARK-25925
> URL: https://issues.apache.org/jira/browse/SPARK-25925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Alex Ivanov
>Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by 
> default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate 
> side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely 
> https://issues.apache.org/jira/browse/SPARK-19611, however that also causes 
> an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive 
> Metastore. This is a problem because even running *explain* on a simple query 
> on a table with thousands of partitions seems to hang, and is very difficult 
> to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using 
> Parquet support in Spark directly. In our experience, this causes significant 
> slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so 
> that people who inevitably run into this behavior can see the resolution, 
> which is changing the parameter to *NEVER_INFER*, provided there are no 
> issues with Parquet-Hive schema compatibility, i.e. all of the schema is in 
> lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678849#comment-16678849
 ] 

Apache Spark commented on SPARK-25956:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22967

> Make Scala 2.12 as default Scala version in Spark 3.0
> -
>
> Key: SPARK-25956
> URL: https://issues.apache.org/jira/browse/SPARK-25956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>
> Scala 2.11 will unlikely support Java 11 
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, 
> we will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25897) Cannot run k8s integration tests in sbt

2018-11-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25897.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22909
[https://github.com/apache/spark/pull/22909]

> Cannot run k8s integration tests in sbt
> ---
>
> Key: SPARK-25897
> URL: https://issues.apache.org/jira/browse/SPARK-25897
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the k8s integration tests use maven, which makes it a little 
> awkward to run them if you use sbt for your day-to-day development. We should 
> hook them up to the sbt build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678728#comment-16678728
 ] 

Ryan Blue commented on SPARK-25966:
---

[~andrioni], were there any failed tasks or executors in the job that wrote 
this file? It looks to me like a problem in closing the file or with an 
executor dying before finishing a file. If that happened and the data wasn't 
cleaned up, then it could lead to this problem.

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> 

[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678700#comment-16678700
 ] 

kevin yu commented on SPARK-25967:
--

Hello Victor: I see, by SQL2003 standard, the TRIM function removes spaces by 
default. 

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread Victor Sahin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678657#comment-16678657
 ] 

Victor Sahin edited comment on SPARK-25967 at 11/7/18 7:13 PM:
---

That is not very intuitive to manually specify especially for the default use 
when let's say someone is mining text and wants to get rid of any and all blank 
characters surrounding a substring

Also regexp_replace is better suited for manually trimming only specific items 
IMO


was (Author: vsahin):
That is not very intuitive to manually specify especially for the default use 
when let's say someone is mining text and wants to get rid of any and all blank 
characters surrounding a substring

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread Victor Sahin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678657#comment-16678657
 ] 

Victor Sahin commented on SPARK-25967:
--

That is not very intuitive to manually specify especially for the default use 
when let's say someone is mining text and wants to get rid of any and all blank 
characters surrounding a substring

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678614#comment-16678614
 ] 

kevin yu commented on SPARK-25967:
--

Hello Victor: You can specify the tabs as specified characters to remove in the 
trim function, the detail syntax is here.

https://spark.apache.org/docs/2.3.0/api/sql/#trim

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678595#comment-16678595
 ] 

Cheng Lian commented on SPARK-25966:


[~andrioni], just realized that I might misunderstand this part of your 
statement:
{quote}
This job used to work fine with Spark 2.2.1
[...]
{quote}
I thought you could read the same problematic files using Spark 2.2.1. Now I 
guess you probably only meant that the same job worked fine with Spark 2.2.1 
previously (with different sets of historical files).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   

[jira] [Issue Comment Deleted] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-07 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-25959:
---
Comment: was deleted

(was: Thanks. I will analyze the issue. )

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at 

[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, and those extra 
columns/row groups happened to contain some corrupted data, which would also 
indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian commented on SPARK-25966:


Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678527#comment-16678527
 ] 

Xiao Li commented on SPARK-25966:
-

Do you still have the file that fail your job? Can you use the previous version 
of Spark to read it?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> 

[jira] [Created] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread Victor Sahin (JIRA)
Victor Sahin created SPARK-25967:


 Summary: sql.functions.trim() should remove trailing and leading 
tabs
 Key: SPARK-25967
 URL: https://issues.apache.org/jira/browse/SPARK-25967
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2, 2.2.0, 2.1.0
Reporter: Victor Sahin


sql.functions.trim removes only trailing and leading whitespaces. Removing tabs 
as well helps use the function for the same use case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread Victor Sahin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Sahin updated SPARK-25967:
-
Description: sql.functions.trim removes only trailing and leading 
whitespaces. Removing tabs as well helps use the function for the same use case 
e.g. artifact cleaning.  (was: sql.functions.trim removes only trailing and 
leading whitespaces. Removing tabs as well helps use the function for the same 
use case.)

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678475#comment-16678475
 ] 

Yuming Wang commented on SPARK-25966:
-

Thanks [~andrioni] Is there an easy way to reproduce it?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
>   at 

[jira] [Updated] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25958:
--
Description: 
Following error happens on a heavy Spark job after 4 hours of runtime..
{code}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 98, in generate_dataframe_main
raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 16, in get_raw_data_with_metadata_and_aggregation
main_df = 
self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
return_df = self.get_dataframe_from_binary_value_iteration(input_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 136, in get_dataframe_from_binary_value_iteration
combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
binary_value=count)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 154, in get_dataframe_from_binary_value
if len(results_of_filter_df.take(1)) == 0:
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
467, in collect
return list(_load_from_socket(sock_info, 
BatchedSerializer(PickleSerializer(
  File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
148, in _load_from_socket
sock = socket.socket(af, socktype, proto)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
__init__
_sock = _realsocket(family, type, proto)
error: [Errno 97] Address family not supported by protocol
{code}
Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
{code}
def _load_from_socket(sock_info, serializer):
port, auth_secret = sock_info
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
sock = socket.socket(af, socktype, proto)
try:
sock.settimeout(15)
sock.connect(sa)
except socket.error:
sock.close()
sock = None
continue
break
if not sock:
raise Exception("could not open socket")
# The RDD materialization time is unpredicable, if we set a timeout for 
socket reading
# operation, it will very possibly fail. See SPARK-18281.
sock.settimeout(None)

sockfile = sock.makefile("rwb", 65536)
do_server_auth(sockfile, auth_secret)

# The socket will be automatically closed when garbage-collected.
return serializer.load_stream(sockfile)
{code}
the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
{code}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}
so the error "error: [Errno 97] *Address family* not supported by protocol"

seems to be caused by socket.AF_UNSPEC third option to the socket.getaddrinfo() 
call.

I tried to call similar socket.getaddrinfo call locally outside of PySpark and 
it worked fine.

RHEL 7.5.

  was:
Following error happens on a heavy Spark job after 4 hours of runtime.. 

{code:python}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File 

[jira] [Issue Comment Deleted] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-11-07 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-23050:

Comment: was deleted

(was: [~ste...@apache.org], I can start working on it.)

> Structured Streaming with S3 file source duplicates data because of eventual 
> consistency.
> -
>
> Key: SPARK-23050
> URL: https://issues.apache.org/jira/browse/SPARK-23050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yash Sharma
>Priority: Major
>
> Spark Structured streaming with S3 file source duplicates data because of 
> eventual consistency.
> Re producing the scenario -
> - Structured streaming reading from S3 source. Writing back to S3.
> - Spark tries to commitTask on completion of a task, by verifying if all the 
> files have been written to Filesystem. 
> {{ManifestFileCommitProtocol.commitTask}}.
> - [Eventual consistency issue] Spark finds that the file is not present and 
> fails the task. {{org.apache.spark.SparkException: Task failed while writing 
> rows. No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}}
> - By this time S3 eventually gets the file.
> - Spark reruns the task and completes the task, but gets a new file name this 
> time. {{ManifestFileCommitProtocol.newTaskTempFile. 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}}
> - Data duplicates in results and the same data is processed twice and written 
> to S3.
> - There is no data duplication if spark is able to list presence of all 
> committed files and all tasks succeed.
> Code:
> {code}
> query = selected_df.writeStream \
> .format("parquet") \
> .option("compression", "snappy") \
> .option("path", "s3://path/data/") \
> .option("checkpointLocation", "s3://path/checkpoint/") \
> .start()
> {code}
> Same sized duplicate S3 Files:
> {code}
> $ aws s3 ls s3://path/data/ | grep part-00256
> 2018-01-11 03:37:00  17070 
> part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet
> 2018-01-11 03:37:10  17070 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet
> {code}
> Exception on S3 listing and task failure:
> {code}
> [Stage 5:>(277 + 100) / 
> 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID  
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:509)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitTask(ManifestFileCommitProtocol.scala:109)
>   at 
> 

[jira] [Created] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Alessandro Andrioni (JIRA)
Alessandro Andrioni created SPARK-25966:
---

 Summary: "EOF Reached the end of stream with bytes left to read" 
while reading/writing to Parquets
 Key: SPARK-25966
 URL: https://issues.apache.org/jira/browse/SPARK-25966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
top of a Mesos cluster. Both input and output Parquet files are on S3.
Reporter: Alessandro Andrioni


I was persistently getting the following exception while trying to run one 
Spark job we have using Spark 2.4.0. It went away after I regenerated from 
scratch all the input Parquet files (generated by another Spark job also using 
Spark 2.4.0).
Is there a chance that Spark is writing (quite rarely) corrupted Parquet files?
{code:java}
org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
(...)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 in 
stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
Reached the end of stream with 996 bytes left to read
at 
org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
at 
org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
at 
org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
at 

[jira] [Updated] (SPARK-25908) Remove old deprecated items in Spark 3

2018-11-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25908:
--
Description: 
There are many deprecated methods and classes in Spark. They _can_ be removed 
in Spark 3, and for those that have been deprecated a long time (i.e. since 
Spark <= 2.0), we should probably do so. This addresses most of these cases, 
the easiest ones, those that are easy to remove and are old:
 - Remove some AccumulableInfo .apply() methods
 - Remove non-label-specific multiclass precision/recall/fScore in favor of 
accuracy
 - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
deprecated)
 - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
deprecated)
 - Remove unused Python StorageLevel constants
 - Remove Dataset unionAll in favor of union
 - Remove unused multiclass option in libsvm parsing
 - Remove references to deprecated spark configs like spark.yarn.am.port
 - Remove TaskContext.isRunningLocally
 - Remove ShuffleMetrics.shuffle* methods
 - Remove BaseReadWrite.context in favor of session
 - Remove Column.!== in favor of =!=
 - Remove Dataset.explode
 - Remove Dataset.registerTempTable
 - Remove SQLContext.getOrCreate, setActive, clearActive, constructors

Not touched yet:
 - everything else in MLLib
 - HiveContext
 - Anything deprecated more recently than 2.0.0, generally

  was:
There are many deprecated methods and classes in Spark. They _can_ be removed 
in Spark 3, and for those that have been deprecated a long time (i.e. since 
Spark <= 2.0), we should probably do so. This addresses most of these cases, 
the easiest ones, those that are easy to remove and are old:

- Remove some AccumulableInfo .apply() methods
- Remove non-label-specific multiclass precision/recall/fScore in favor of 
accuracy
- Remove toDegrees/toRadians in favor of degrees/radians
- Remove approxCountDistinct in favor of approx_count_distinct
- Remove unused Python StorageLevel constants
- Remove Dataset unionAll in favor of union
- Remove unused multiclass option in libsvm parsing
- Remove references to deprecated spark configs like spark.yarn.am.port
- Remove TaskContext.isRunningLocally
- Remove ShuffleMetrics.shuffle* methods
- Remove BaseReadWrite.context in favor of session
- Remove Column.!== in favor of =!=
- Remove Dataset.explode
- Remove Dataset.registerTempTable
- Remove SQLContext.getOrCreate, setActive, clearActive, constructors

Not touched yet:

- everything else in MLLib
- HiveContext
- Anything deprecated more recently than 2.0.0, generally


> Remove old deprecated items in Spark 3
> --
>
> Key: SPARK-25908
> URL: https://issues.apache.org/jira/browse/SPARK-25908
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> There are many deprecated methods and classes in Spark. They _can_ be removed 
> in Spark 3, and for those that have been deprecated a long time (i.e. since 
> Spark <= 2.0), we should probably do so. This addresses most of these cases, 
> the easiest ones, those that are easy to remove and are old:
>  - Remove some AccumulableInfo .apply() methods
>  - Remove non-label-specific multiclass precision/recall/fScore in favor of 
> accuracy
>  - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
> deprecated)
>  - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
> deprecated)
>  - Remove unused Python StorageLevel constants
>  - Remove Dataset unionAll in favor of union
>  - Remove unused multiclass option in libsvm parsing
>  - Remove references to deprecated spark configs like spark.yarn.am.port
>  - Remove TaskContext.isRunningLocally
>  - Remove ShuffleMetrics.shuffle* methods
>  - Remove BaseReadWrite.context in favor of session
>  - Remove Column.!== in favor of =!=
>  - Remove Dataset.explode
>  - Remove Dataset.registerTempTable
>  - Remove SQLContext.getOrCreate, setActive, clearActive, constructors
> Not touched yet:
>  - everything else in MLLib
>  - HiveContext
>  - Anything deprecated more recently than 2.0.0, generally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25964) Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25964:


Assignee: Apache Spark

> Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution 
> instructions
> -
>
> Key: SPARK-25964
> URL: https://issues.apache.org/jira/browse/SPARK-25964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Trivial
>
> 1. OrcReadBenchmark is under hive module, so the way to run it should be 
> ```
> build/sbt "hive/test:runMain "
> ```
> 2. The benchmark "String with Nulls Scan" should be with case "String with 
> Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"
> 3. Add the null value percentages in the test case names of 
> DataSourceReadBenchmark, for the  benchmark "String with Nulls Scan" .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25964) Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25964:


Assignee: (was: Apache Spark)

> Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution 
> instructions
> -
>
> Key: SPARK-25964
> URL: https://issues.apache.org/jira/browse/SPARK-25964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> 1. OrcReadBenchmark is under hive module, so the way to run it should be 
> ```
> build/sbt "hive/test:runMain "
> ```
> 2. The benchmark "String with Nulls Scan" should be with case "String with 
> Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"
> 3. Add the null value percentages in the test case names of 
> DataSourceReadBenchmark, for the  benchmark "String with Nulls Scan" .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25965) Add read benchmark for Avro

2018-11-07 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25965:
--

 Summary: Add read benchmark for Avro
 Key: SPARK-25965
 URL: https://issues.apache.org/jira/browse/SPARK-25965
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Add read benchmark for Avro, which is missing for a period.
The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25885) HighlyCompressedMapStatus deserialization optimization

2018-11-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25885.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22894
[https://github.com/apache/spark/pull/22894]

> HighlyCompressedMapStatus deserialization optimization
> --
>
> Key: SPARK-25885
> URL: https://issues.apache.org/jira/browse/SPARK-25885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Artem Kupchinskiy
>Assignee: Artem Kupchinskiy
>Priority: Minor
> Fix For: 3.0.0
>
>
> HighlyCompressedMapStatus uses unnecessary indirection level during 
> deserialization and construction. It uses ArrayBuffer, as an interim storage, 
> before the actual map construction. Since both methods could be application 
> hot spots under certain workloads, it is worth to get rid of that 
> intermediate level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25885) HighlyCompressedMapStatus deserialization optimization

2018-11-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25885:
-

Assignee: Artem Kupchinskiy

> HighlyCompressedMapStatus deserialization optimization
> --
>
> Key: SPARK-25885
> URL: https://issues.apache.org/jira/browse/SPARK-25885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Artem Kupchinskiy
>Assignee: Artem Kupchinskiy
>Priority: Minor
> Fix For: 3.0.0
>
>
> HighlyCompressedMapStatus uses unnecessary indirection level during 
> deserialization and construction. It uses ArrayBuffer, as an interim storage, 
> before the actual map construction. Since both methods could be application 
> hot spots under certain workloads, it is worth to get rid of that 
> intermediate level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25964) Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678353#comment-16678353
 ] 

Apache Spark commented on SPARK-25964:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22965

> Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution 
> instructions
> -
>
> Key: SPARK-25964
> URL: https://issues.apache.org/jira/browse/SPARK-25964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> 1. OrcReadBenchmark is under hive module, so the way to run it should be 
> ```
> build/sbt "hive/test:runMain "
> ```
> 2. The benchmark "String with Nulls Scan" should be with case "String with 
> Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"
> 3. Add the null value percentages in the test case names of 
> DataSourceReadBenchmark, for the  benchmark "String with Nulls Scan" .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25964) Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions

2018-11-07 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25964:
--

 Summary: Revise OrcReadBenchmark/DataSourceReadBenchmark case 
names and execution instructions
 Key: SPARK-25964
 URL: https://issues.apache.org/jira/browse/SPARK-25964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


1. OrcReadBenchmark is under hive module, so the way to run it should be 
```
build/sbt "hive/test:runMain "
```

2. The benchmark "String with Nulls Scan" should be with case "String with 
Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"

3. Add the null value percentages in the test case names of 
DataSourceReadBenchmark, for the  benchmark "String with Nulls Scan" .




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25963) Optimize generate followed by window

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25963:


Assignee: Apache Spark

> Optimize generate followed by window
> 
>
> Key: SPARK-25963
> URL: https://issues.apache.org/jira/browse/SPARK-25963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Assignee: Apache Spark
>Priority: Minor
>
> We've noticed that for our use-cases when we have explode followed by a 
> window function we can almost always optimize it by adding repartition by the 
> windows' partition before the explode.
> for example:
> {code:java}
> import org.apache.spark.sql.functions._
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val tokens = spark.range(N).selectExpr(
> "floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
> // .repartition("cust_id")
> .selectExpr("*", "explode(split(url, '/')) as token")
> import org.apache.spark.sql.expressions._
> val w = Window.partitionBy("key", "token")
> val res = tokens.withColumn("cnt", count("token").over(w))
> res.explain(true)
> {code}
> {noformat}
> == Optimized Logical Plan ==
> Window [count(token#11) windowspecdefinition(key#6L, token#11, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS cnt#17L], [key#6L, token#11]
> +- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
>+- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
> asd/asd/asd/asd/asd/asd AS url#7]
>   +- Range (0, 4096, step=1, splits=Some(1))
> {noformat}
> currently all the data will be exploded in the first stage, then shuffled and 
> then aggregated.
> we can achieve exactly the same computation if we first shuffle the data and 
> in the second stage explode and aggregate.
> I have a PR that tries to resolve this. I'm just not sure I thought about all 
> the cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25963) Optimize generate followed by window

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678149#comment-16678149
 ] 

Apache Spark commented on SPARK-25963:
--

User 'uzadude' has created a pull request for this issue:
https://github.com/apache/spark/pull/22964

> Optimize generate followed by window
> 
>
> Key: SPARK-25963
> URL: https://issues.apache.org/jira/browse/SPARK-25963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We've noticed that for our use-cases when we have explode followed by a 
> window function we can almost always optimize it by adding repartition by the 
> windows' partition before the explode.
> for example:
> {code:java}
> import org.apache.spark.sql.functions._
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val tokens = spark.range(N).selectExpr(
> "floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
> // .repartition("cust_id")
> .selectExpr("*", "explode(split(url, '/')) as token")
> import org.apache.spark.sql.expressions._
> val w = Window.partitionBy("key", "token")
> val res = tokens.withColumn("cnt", count("token").over(w))
> res.explain(true)
> {code}
> {noformat}
> == Optimized Logical Plan ==
> Window [count(token#11) windowspecdefinition(key#6L, token#11, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS cnt#17L], [key#6L, token#11]
> +- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
>+- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
> asd/asd/asd/asd/asd/asd AS url#7]
>   +- Range (0, 4096, step=1, splits=Some(1))
> {noformat}
> currently all the data will be exploded in the first stage, then shuffled and 
> then aggregated.
> we can achieve exactly the same computation if we first shuffle the data and 
> in the second stage explode and aggregate.
> I have a PR that tries to resolve this. I'm just not sure I thought about all 
> the cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25963) Optimize generate followed by window

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678148#comment-16678148
 ] 

Apache Spark commented on SPARK-25963:
--

User 'uzadude' has created a pull request for this issue:
https://github.com/apache/spark/pull/22964

> Optimize generate followed by window
> 
>
> Key: SPARK-25963
> URL: https://issues.apache.org/jira/browse/SPARK-25963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We've noticed that for our use-cases when we have explode followed by a 
> window function we can almost always optimize it by adding repartition by the 
> windows' partition before the explode.
> for example:
> {code:java}
> import org.apache.spark.sql.functions._
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val tokens = spark.range(N).selectExpr(
> "floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
> // .repartition("cust_id")
> .selectExpr("*", "explode(split(url, '/')) as token")
> import org.apache.spark.sql.expressions._
> val w = Window.partitionBy("key", "token")
> val res = tokens.withColumn("cnt", count("token").over(w))
> res.explain(true)
> {code}
> {noformat}
> == Optimized Logical Plan ==
> Window [count(token#11) windowspecdefinition(key#6L, token#11, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS cnt#17L], [key#6L, token#11]
> +- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
>+- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
> asd/asd/asd/asd/asd/asd AS url#7]
>   +- Range (0, 4096, step=1, splits=Some(1))
> {noformat}
> currently all the data will be exploded in the first stage, then shuffled and 
> then aggregated.
> we can achieve exactly the same computation if we first shuffle the data and 
> in the second stage explode and aggregate.
> I have a PR that tries to resolve this. I'm just not sure I thought about all 
> the cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25963) Optimize generate followed by window

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25963:


Assignee: (was: Apache Spark)

> Optimize generate followed by window
> 
>
> Key: SPARK-25963
> URL: https://issues.apache.org/jira/browse/SPARK-25963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We've noticed that for our use-cases when we have explode followed by a 
> window function we can almost always optimize it by adding repartition by the 
> windows' partition before the explode.
> for example:
> {code:java}
> import org.apache.spark.sql.functions._
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val tokens = spark.range(N).selectExpr(
> "floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
> // .repartition("cust_id")
> .selectExpr("*", "explode(split(url, '/')) as token")
> import org.apache.spark.sql.expressions._
> val w = Window.partitionBy("key", "token")
> val res = tokens.withColumn("cnt", count("token").over(w))
> res.explain(true)
> {code}
> {noformat}
> == Optimized Logical Plan ==
> Window [count(token#11) windowspecdefinition(key#6L, token#11, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS cnt#17L], [key#6L, token#11]
> +- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
>+- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
> asd/asd/asd/asd/asd/asd AS url#7]
>   +- Range (0, 4096, step=1, splits=Some(1))
> {noformat}
> currently all the data will be exploded in the first stage, then shuffled and 
> then aggregated.
> we can achieve exactly the same computation if we first shuffle the data and 
> in the second stage explode and aggregate.
> I have a PR that tries to resolve this. I'm just not sure I thought about all 
> the cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25963) Optimize generate followed by window

2018-11-07 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-25963:
--

 Summary: Optimize generate followed by window
 Key: SPARK-25963
 URL: https://issues.apache.org/jira/browse/SPARK-25963
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Ohad Raviv


We've noticed that for our use-cases when we have explode followed by a window 
function we can almost always optimize it by adding repartition by the windows' 
partition before the explode.

for example:
{code:java}
import org.apache.spark.sql.functions._
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val tokens = spark.range(N).selectExpr(
"floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
// .repartition("cust_id")
.selectExpr("*", "explode(split(url, '/')) as token")

import org.apache.spark.sql.expressions._

val w = Window.partitionBy("key", "token")
val res = tokens.withColumn("cnt", count("token").over(w))

res.explain(true)
{code}

{noformat}
== Optimized Logical Plan ==
Window [count(token#11) windowspecdefinition(key#6L, token#11, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS cnt#17L], [key#6L, token#11]
+- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
   +- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
asd/asd/asd/asd/asd/asd AS url#7]
  +- Range (0, 4096, step=1, splits=Some(1))
{noformat}

currently all the data will be exploded in the first stage, then shuffled and 
then aggregated.
we can achieve exactly the same computation if we first shuffle the data and in 
the second stage explode and aggregate.

I have a PR that tries to resolve this. I'm just not sure I thought about all 
the cases.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25904) Avoid allocating arrays too large for JVMs

2018-11-07 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-25904.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22818
[https://github.com/apache/spark/pull/22818]

> Avoid allocating arrays too large for JVMs
> --
>
> Key: SPARK-25904
> URL: https://issues.apache.org/jira/browse/SPARK-25904
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> In a few places spark can try to allocate arrays as big as {{Int.MaxValue}}, 
> but thats actually too big for the JVM.  We should consistently use 
> {{ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}} instead.
> In some cases this is changing defaults for configs, in some cases its bounds 
> on a config, and others its just improving error msgs for things that still 
> won't work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25904) Avoid allocating arrays too large for JVMs

2018-11-07 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25904:


Assignee: Imran Rashid

> Avoid allocating arrays too large for JVMs
> --
>
> Key: SPARK-25904
> URL: https://issues.apache.org/jira/browse/SPARK-25904
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> In a few places spark can try to allocate arrays as big as {{Int.MaxValue}}, 
> but thats actually too big for the JVM.  We should consistently use 
> {{ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}} instead.
> In some cases this is changing defaults for configs, in some cases its bounds 
> on a config, and others its just improving error msgs for things that still 
> won't work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678035#comment-16678035
 ] 

Apache Spark commented on SPARK-25962:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22963

> Specify minimum versions for both pydocstyle and flake8 in 'lint-python' 
> script
> ---
>
> Key: SPARK-25962
> URL: https://issues.apache.org/jira/browse/SPARK-25962
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, 'lint-python' script does not specify minimum versions for both 
> pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678034#comment-16678034
 ] 

Apache Spark commented on SPARK-25962:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22963

> Specify minimum versions for both pydocstyle and flake8 in 'lint-python' 
> script
> ---
>
> Key: SPARK-25962
> URL: https://issues.apache.org/jira/browse/SPARK-25962
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, 'lint-python' script does not specify minimum versions for both 
> pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25962:


Assignee: (was: Apache Spark)

> Specify minimum versions for both pydocstyle and flake8 in 'lint-python' 
> script
> ---
>
> Key: SPARK-25962
> URL: https://issues.apache.org/jira/browse/SPARK-25962
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, 'lint-python' script does not specify minimum versions for both 
> pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25962:


Assignee: Apache Spark

> Specify minimum versions for both pydocstyle and flake8 in 'lint-python' 
> script
> ---
>
> Key: SPARK-25962
> URL: https://issues.apache.org/jira/browse/SPARK-25962
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, 'lint-python' script does not specify minimum versions for both 
> pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-25962:


 Summary: Specify minimum versions for both pydocstyle and flake8 
in 'lint-python' script
 Key: SPARK-25962
 URL: https://issues.apache.org/jira/browse/SPARK-25962
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Currently, 'lint-python' script does not specify minimum versions for both 
pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677959#comment-16677959
 ] 

Apache Spark commented on SPARK-25921:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22962

> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25921:


Assignee: (was: Apache Spark)

> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677954#comment-16677954
 ] 

Apache Spark commented on SPARK-25921:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22962

> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25921:


Assignee: Apache Spark

> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Critical
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25961) 处理数据倾斜时使用随机数不支持

2018-11-07 Thread zengxl (JIRA)
zengxl created SPARK-25961:
--

 Summary: 处理数据倾斜时使用随机数不支持
 Key: SPARK-25961
 URL: https://issues.apache.org/jira/browse/SPARK-25961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
 Environment: spark on yarn 2.3.1
Reporter: zengxl


两个表连接,有一个表存在空值,给join键加上随机数,提示不可以

Error in query: nondeterministic expressions are only allowed in

Project, Filter, Aggregate or Window, found

查看源码发现是在org.apache.spark.sql.catalyst.analysis.CheckAnalysis进行sql校验,由于随机数是不确定值被禁止了

case o if o.expressions.exists(!_.deterministic) &&
 !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
 !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
 // The rule above is used to check Aggregate operator.
 failAnalysis(
 s"""nondeterministic expressions are only allowed in
 |Project, Filter, Aggregate or Window, found:
 | ${o.expressions.map(_.sql).mkString(",")}
 |in operator ${operator.simpleString}
 """.stripMargin)

是否在这段代码加上Join情况就可以?现在还没测试

case o if o.expressions.exists(!_.deterministic) &&
 !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
 !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& 
!o.isInstanceOf[Join]{color}+ =>
 // The rule above is used to check Aggregate operator.
 failAnalysis(
 s"""nondeterministic expressions are only allowed in
 |Project, Filter, Aggregate or Window or Join, found:
 | ${o.expressions.map(_.sql).mkString(",")}
 |in operator ${operator.simpleString}
 """.stripMargin)

 

我的sql:

SELECT
T1.CUST_NO AS CUST_NO ,
T3.CON_LAST_NAME AS CUST_NAME ,
T3.CON_SEX_MF AS SEX_CODE ,
T3.X_POSITION AS POST_LV_CODE 
FROM tmp.ICT_CUST_RANGE_INFO T1
LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') ='' 
THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and 
T3.DATE='20181105'
WHERE T1.DATE='20181105'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org