[jira] [Commented] (SPARK-39457) Support pure IPV6 environment without IPV4

2022-06-13 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553901#comment-17553901
 ] 

Ruslan Dautkhanov commented on SPARK-39457:
---

Is there a dependency on Hadoop to support IPv6 too? HADOOP-11890 

> Support pure IPV6 environment without IPV4
> --
>
> Key: SPARK-39457
> URL: https://issues.apache.org/jira/browse/SPARK-39457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: releasenotes
>
> Spark doesn't fully work in pure IPV6 environment that doesn't have IPV4 at 
> all. This is an umbrella jira tracking the support of pure IPV6 deployment. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26413) SPIP: RDD Arrow Support in Spark Core and PySpark

2021-11-08 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440731#comment-17440731
 ] 

Ruslan Dautkhanov commented on SPARK-26413:
---

[https://github.com/apache/spark/pull/34505] is in. 

Part of https://issues.apache.org/jira/browse/SPARK-37227 

 

> SPIP: RDD Arrow Support in Spark Core and PySpark
> -
>
> Key: SPARK-26413
> URL: https://issues.apache.org/jira/browse/SPARK-26413
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: Richard Whitcomb
>Priority: Minor
>
> h2. Background and Motivation
> Arrow is becoming an standard interchange format for columnar Structured 
> Data.  This is already true in Spark with the use of arrow in the pandas udf 
> functions in the dataframe API.
> However the current implementation of arrow in spark is limited to two use 
> cases.
>  * Pandas UDF that allows for operations on one or more columns in the 
> DataFrame API.
>  * Collect as Pandas which pulls back the entire dataset to the driver in a 
> Pandas Dataframe.
> What is still hard however is making use of all of the columns in a Dataframe 
> while staying distributed across the workers.  The only way to do this 
> currently is to drop down into RDDs and collect the rows into a dataframe. 
> However pickling is very slow and the collecting is expensive.
> The proposal is to extend spark in a way that allows users to operate on an 
> Arrow Table fully while still making use of Spark's underlying technology.  
> Some examples of possibilities with this new API. 
>  * Pass the Arrow Table with Zero Copy to PyTorch for predictions.
>  * Pass to Nvidia Rapids for an algorithm to be run on the GPU.
>  * Distribute data across many GPUs making use of the new Barriers API.
> h2. Targets users and personas
> ML, Data Scientists, and future library authors..
> h2. Goals
>  * Conversion from any Dataset[Row] or PySpark Dataframe to RDD[Table]
>  * Conversion back from any RDD[Table] to Dataset[Row], RDD[Row], Pyspark 
> Dataframe
>  * Open the possibilities to tighter integration between Arrow/Pandas/Spark 
> especially at a library level.
> h2. Non-Goals
>  * Not creating a new API but instead using existing APIs.
> h2. Proposed API changes
> h3. Data Objects
> case class ArrowTable(schema: Schema, batches: Iterable[ArrowRecordBatch])
> h3. Dataset.scala
> {code:java}
> // Converts a Dataset to an RDD of Arrow Tables
> // Each RDD row is an Interable of Arrow Batches.
> def arrowRDD: RDD[ArrowTable]
>  
> // Utility Function to convert to RDD Arrow Table for PySpark
> private[sql] def javaToPythonArrow: JavaRDD[Array[Byte]]
> {code}
> h3. RDD.scala
> {code:java}
>  // Converts RDD[ArrowTable] to an Dataframe by inspecting the Arrow Schema
>  def arrowToDataframe(implicit ev: T <:< ArrowTable): Dataframe
>   
>  // Converts RDD[ArrowTable] to an RDD of Rows
>  def arrowToRDD(implicit ev: T <:< ArrowTable): RDD[Row]{code}
> h3. Serializers.py
> {code:java}
> # Serializer that takes a Serialized Arrow Tables and returns a pyarrow Table.
> class ArrowSerializer(FramedSerializer)
> {code}
> h3. RDD.py
> {code}
> # New RDD Class that has an RDD[ArrowTable] behind it and uses the new 
> ArrowSerializer instead of the normal Pickle Serializer
> class ArrowRDD(RDD){code}
>  
> h3. Dataframe.py
> {code}
> // New Function that converts a pyspark dataframe into an ArrowRDD
> def arrow(self):
> {code}
>  
> h2. Example API Usage
> h3. Pyspark
> {code}
> # Select a Single Column Using Pandas
> def map_table(arrow_table):
>   import pyarrow as pa
>   pdf = arrow_table.to_pandas()
>   pdf = pdf[['email']]
>   return pa.Table.from_pandas(pdf)
> # Convert to Arrow RDD, map over tables, convert back to dataframe
> df.arrow.map(map_table).dataframe 
> {code}
> h3. Scala
>  
> {code:java}
> // Find N Centroids using Cuda Rapids kMeans
> def runCuKmeans(table: ArrowTable, clusters: Int): ArrowTable
>  
> // Convert Dataset[Row] to RDD[ArrowTable] and back to Dataset[Row]
> df.arrowRDD.map(table => runCuKmeans(table, N)).arrowToDataframe.show(10)
> {code}
>  
> h2. Implementation Details
> As mentioned in the first section, the goal is to make it easier for Spark 
> users to interact with Arrow tools and libraries.  This however does come 
> with some considerations from a Spark perspective.
>  Arrow is column based instead of Row based.  In the above API proposal of 
> RDD[ArrowTable] each RDD row will in fact be a block of data.  Another 
> proposal in this regard is to introduce a new parameter to Spark called 
> arrow.sql.execution.arrow.maxRecordsPerTable.  The goal of this parameter is 
> to decide how many records are included in a single Arrow Table.  If set to 
> -1 the entire partition will be included in the table else to 

[jira] [Commented] (SPARK-32399) Support full outer join in shuffled hash join

2020-10-14 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214440#comment-17214440
 ] 

Ruslan Dautkhanov commented on SPARK-32399:
---

Here's another view if that's helpful that shows join keys, build side, and 
type of join in a bit more details - 

!Screen Shot 2020-10-14 at 11.08.37 PM.png|width=461,height=433!

> Support full outer join in shuffled hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-10-14 at 11.08.37 PM.png, Screen Shot 
> 2020-10-14 at 12.30.07 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32399) Support full outer join in shuffled hash join

2020-10-14 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-32399:
--
Attachment: Screen Shot 2020-10-14 at 11.08.37 PM.png

> Support full outer join in shuffled hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-10-14 at 11.08.37 PM.png, Screen Shot 
> 2020-10-14 at 12.30.07 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32399) Support full outer join in shuffled hash join

2020-10-14 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214426#comment-17214426
 ] 

Ruslan Dautkhanov commented on SPARK-32399:
---

[~chengsu] thank you for all the Shuffled Hash Join improvements.

I've tested Full Outer Join case using master and found that both SMJ and SHJ 
performed exactly the same - 14 minutes of runtime for both types of joins.

One table being just 300k records and another table 25B records / 2Tb of data. 
Ran tests multiple times and it's consistent. 

What I don't understand about SHJ is - it still seems does a complete shuffling 
of the larger table:

!Screen Shot 2020-10-14 at 12.30.07 PM.png|width=583,height=514!

Is this expected? 

To be honest, I have not used SHJ as SMJ was a safer bet before all the new 
improvements in Spark 3.1. Let me know what I miss. The join is based on a 
composite key, so not sure if it has anything to do with this. Thanks!

 

> Support full outer join in shuffled hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-10-14 at 12.30.07 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32399) Support full outer join in shuffled hash join

2020-10-14 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-32399:
--
Attachment: Screen Shot 2020-10-14 at 12.30.07 PM.png

> Support full outer join in shuffled hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-10-14 at 12.30.07 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32760) Support for INET data type

2020-09-01 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188612#comment-17188612
 ] 

Ruslan Dautkhanov edited comment on SPARK-32760 at 9/1/20, 4:29 PM:


[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level / logical data types then? IPv4 address for example fits 
nicely into parquet's _INT64_ physical data type. Feel free to close if it's 
not feasible near-term. Thanks.


was (Author: tagar):
[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level data types then? IPv4 address for example fits nicely into 
parquet's _INT64_ data type. Feel free to close if it's not feasible near-term. 
Thanks.

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32760) Support for INET data type

2020-09-01 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188612#comment-17188612
 ] 

Ruslan Dautkhanov commented on SPARK-32760:
---

[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level data types then? IPv4 address for example fits nicely into 
parquet's _INT64_ data type. Feel free to close if it's not feasible near-term. 
Thanks.

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32759) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-32759.
---
Resolution: Duplicate

> Support for INET data type
> --
>
> Key: SPARK-32759
> URL: https://issues.apache.org/jira/browse/SPARK-32759
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32760) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)
Ruslan Dautkhanov created SPARK-32760:
-

 Summary: Support for INET data type
 Key: SPARK-32760
 URL: https://issues.apache.org/jira/browse/SPARK-32760
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Ruslan Dautkhanov


PostgreSQL has support for `INET` data type 

[https://www.postgresql.org/docs/9.1/datatype-net-types.html]

We have a few customers that are interested in similar, native support for IP 
addresses, just like in PostgreSQL.

The issue with storing IP addresses as strings, is that most of the matches 
(like if an IP address belong to a subnet) in most cases can't take leverage of 
parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32759) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)
Ruslan Dautkhanov created SPARK-32759:
-

 Summary: Support for INET data type
 Key: SPARK-32759
 URL: https://issues.apache.org/jira/browse/SPARK-32759
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Ruslan Dautkhanov


PostgreSQL has support for `INET` data type 

[https://www.postgresql.org/docs/9.1/datatype-net-types.html]

We have a few customers that are interested in similar, native support for IP 
addresses, just like in PostgreSQL.

The issue with storing IP addresses as strings, is that most of the matches 
(like if an IP address belong to a subnet) in most cases can't take leverage of 
parquet bloom filters. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-08-13 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177202#comment-17177202
 ] 

Ruslan Dautkhanov commented on SPARK-28367:
---

[~gsomogyi] thanks! yep would be great to learn how this is done on the Flink 
side. 

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2020-07-13 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-32294:
--
Description: 
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 

  was:
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF as once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 


> GroupedData Pandas UDF 2Gb limit
> 
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2020-07-13 Thread Ruslan Dautkhanov (Jira)
Ruslan Dautkhanov created SPARK-32294:
-

 Summary: GroupedData Pandas UDF 2Gb limit
 Key: SPARK-32294
 URL: https://issues.apache.org/jira/browse/SPARK-32294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0, 3.1.0
Reporter: Ruslan Dautkhanov


`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF as once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28266) data duplication when `path` serde property is present

2020-01-13 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014467#comment-17014467
 ] 

Ruslan Dautkhanov edited comment on SPARK-28266 at 1/13/20 4:46 PM:


Thank you for checking [~dongjoon]

That may have been a Cloudera Distribution of Spark issue all along (I did have 
a support case with Cloudera last year on this and it did not go anywhere on 
Spark side - Cloudera were fixing that from another side, by fixing `path` 
correctly on tables that were replicated )

I have moved on and no longer have access to a Cloudera environment. 


was (Author: tagar):
Thank you for checking [~dongjoon]

That may have been a Cloudera Distribution of Spark issue all along (I did have 
a support case with Cloudera last year on this and it did not go anywhere on 
Spark side - Cloudera were fixing that from another side, by fixing `path` 
correctly on tables that were replicated )

I have moved on and not longer have access to a Cloudera environment. 

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28266) data duplication when `path` serde property is present

2020-01-13 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014467#comment-17014467
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

Thank you for checking [~dongjoon]

That may have been a Cloudera Distribution of Spark issue all along (I did have 
a support case with Cloudera last year on this and it did not go anywhere on 
Spark side - Cloudera were fixing that from another side, by fixing `path` 
correctly on tables that were replicated )

I have moved on and not longer have access to a Cloudera environment. 

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002402#comment-17002402
 ] 

Ruslan Dautkhanov commented on SPARK-29224:
---

E.g. would this work with 0.1m or 1m sparse features?

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002398#comment-17002398
 ] 

Ruslan Dautkhanov commented on SPARK-29224:
---

That's great.

Out of curiosity - what's largest number of features this was tested with?

 

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view

2019-12-04 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-21488:
---

> Make saveAsTable() and createOrReplaceTempView() return dataframe of created 
> table/ created view
> 
>
> Key: SPARK-21488
> URL: https://issues.apache.org/jira/browse/SPARK-21488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: bulk-closed
>
> It would be great to make saveAsTable() return dataframe of created table, 
> so you could pipe result further as for example
> {code}
> mv_table_df = (sqlc.sql('''
> SELECT ...
> FROM 
> ''')
> .write.format("parquet").mode("overwrite")
> .saveAsTable('test.parquet_table')
> .createOrReplaceTempView('mv_table')
> )
> {code}
> ... Above code returns now expectedly:
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
> {noformat}
> If this is implemented, we can skip a step like
> {code}
> sqlc.sql('SELECT * FROM 
> test.parquet_table').createOrReplaceTempView('mv_table')
> {code}
> We have this pattern very frequently. 
> Further improvement can be made if createOrReplaceTempView also returns 
> dataframe object, so in one pipeline of functions 
> we can 
> - create an external table 
> - create a dataframe reference to this newly created for SparkSQL and as a 
> Spark variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view

2019-12-04 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988011#comment-16988011
 ] 

Ruslan Dautkhanov commented on SPARK-21488:
---

[~zsxwing] any chance this can be added to Spark 3.0?

I can try to create a PR for this.. many of our users are still reporting this 
as relevant 
as this would streamline their code in many places . 

We always have a good mix of Spark SQL and Spark API calls in many places,
and this would be a huge win for code readability. 

 

> Make saveAsTable() and createOrReplaceTempView() return dataframe of created 
> table/ created view
> 
>
> Key: SPARK-21488
> URL: https://issues.apache.org/jira/browse/SPARK-21488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: bulk-closed
>
> It would be great to make saveAsTable() return dataframe of created table, 
> so you could pipe result further as for example
> {code}
> mv_table_df = (sqlc.sql('''
> SELECT ...
> FROM 
> ''')
> .write.format("parquet").mode("overwrite")
> .saveAsTable('test.parquet_table')
> .createOrReplaceTempView('mv_table')
> )
> {code}
> ... Above code returns now expectedly:
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
> {noformat}
> If this is implemented, we can skip a step like
> {code}
> sqlc.sql('SELECT * FROM 
> test.parquet_table').createOrReplaceTempView('mv_table')
> {code}
> We have this pattern very frequently. 
> Further improvement can be made if createOrReplaceTempView also returns 
> dataframe object, so in one pipeline of functions 
> we can 
> - create an external table 
> - create a dataframe reference to this newly created for SparkSQL and as a 
> Spark variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2019-12-04 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987999#comment-16987999
 ] 

Ruslan Dautkhanov commented on SPARK-19842:
---

>From the design document 

"""

This alternative proposes to use the KEY_CONSTRAINTS catalog table when Spark 
upgrates to Hive 2.1. Therefore, this proposal will introduce a dependency on 
Hive metastore 2.1. 

""" 

It seems Spark 3.0 is moving towards Hive 2.1 which has FK support.. would it 
be possible to add FKs and related optimizations to Spark 3.0 too? 

Thanks!

 

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-11-11 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971723#comment-16971723
 ] 

Ruslan Dautkhanov commented on SPARK-22340:
---

Glad to see this is solved. 

A nice side-effect should be somewhat better performance on some cases 
involving heavy python-java communication
on multi-numa/ multi-socket configurations. With static threads, Linux kernel 
will actually have a chance to 
schedule threads on processors/cores that are more local to data's numa 
placement. 

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Mortenson
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-10-16 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953163#comment-16953163
 ] 

Ruslan Dautkhanov commented on SPARK-29041:
---

[~hyukjin.kwon] thanks for getting back on this .. I see discussion in the PR 
regarding Python 2 and Python 3,

but no discussion regarding applying that patch to Spark 2.3... what do I miss? 

Thanks.

> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails as below:
> in Python 3
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object b'abcd' in type  'bytes'>
> {code}
> in Python 2:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object 'abcd' in type  'str'>
> {code}
> {{bytes}} should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-09-30 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941289#comment-16941289
 ] 

Ruslan Dautkhanov commented on SPARK-29041:
---

Thank you [~hyukjin.kwon]

Our users say this issue exists in 2.3 too.. could it be possible to apply that 
patch to 2.3 branch as well?

 

> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails as below:
> in Python 3
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object b'abcd' in type  'bytes'>
> {code}
> in Python 2:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object 'abcd' in type  'str'>
> {code}
> {{bytes}} should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28266) data duplication when `path` serde property is present

2019-07-12 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-28266:
--
Summary: data duplication when `path` serde property is present  (was: data 
correctness issue: data duplication when `path` serde property is present)

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-11 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883222#comment-16883222
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

Another interesting side Spark bug found while was trying to fix this issue.

If `spark.sql.sources.provider` table property IS present and `path` serde 
property IS NOT present,
then Spark will happily always return 0 (zero) records irrespective of all the 
files that `LOCATION` points at.

 

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-10 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882381#comment-16882381
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

This issue happens `spark.sql.sources.provider` table property is NOT present, 
and `path` serde property is present -

Spark duplicates records in this case.

 

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-10 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882200#comment-16882200
 ] 

Ruslan Dautkhanov commented on SPARK-28266:
---

Suspecting change in SPARK-22158 causes this 

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22158) convertMetastore should not ignore storage properties

2019-07-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881602#comment-16881602
 ] 

Ruslan Dautkhanov commented on SPARK-22158:
---

[~dongjoon] I may have misreported it - sorry. 

[~waleedfateem] made some tests, I thought 2.2.0 is affected as well, but 
you're probably right that 2.2.1 is the first one affected.
Cloudera has pointed to this Jira.

Thank you. 

> convertMetastore should not ignore storage properties
> -
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22158) convertMetastore should not ignore storage properties

2019-07-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881437#comment-16881437
 ] 

Ruslan Dautkhanov edited comment on SPARK-22158 at 7/9/19 6:57 PM:
---

[~dongjoon] can you please check if PR-20522 causes SPARK-28266 data 
correctness regression?

Thank you.


was (Author: tagar):
[~dongjoon] can you please check if this causes SPARK-28266 data correctness 
regression? 

Thank you.

> convertMetastore should not ignore storage properties
> -
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22158) convertMetastore should not ignore storage properties

2019-07-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881437#comment-16881437
 ] 

Ruslan Dautkhanov commented on SPARK-22158:
---

[~dongjoon] can you please check if this causes SPARK-28266 data correctness 
regression? 

Thank you.

> convertMetastore should not ignore storage properties
> -
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28266) data correctness issue: data duplication when `path` serde property is present

2019-07-08 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-28266:
--
Summary: data correctness issue: data duplication when `path` serde 
property is present  (was: data correctness issue: data duplication when `path` 
serde peroperty is present)

> data correctness issue: data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28266) data correctness issue: data duplication when `path` serde peroperty is present

2019-07-05 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-28266:
-

 Summary: data correctness issue: data duplication when `path` 
serde peroperty is present
 Key: SPARK-28266
 URL: https://issues.apache.org/jira/browse/SPARK-28266
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core
Affects Versions: 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 
2.2.3, 2.2.2, 2.2.1, 2.2.0, 2.3.4, 2.4.4, 3.0.0
Reporter: Ruslan Dautkhanov


Spark duplicates returned datasets when `path` serde is present in a parquet 
table. 

Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.

Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at 
least).

Reproducer:

{code:python}
>>> spark.sql("create table ruslan_test.test55 as select 1 as id")
DataFrame[]

>>> spark.table("ruslan_test.test55").explain()

== Physical Plan ==
HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]

>>> spark.table("ruslan_test.test55").count()
1

{code}

(all is good at this point, now exist session and run in Hive for example - )

{code:sql}
ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
{code}

So LOCATION and serde `path` property would point to the same location.
Now see count returns two records instead of one:

{code:python}
>>> spark.table("ruslan_test.test55").count()
2

>>> spark.table("ruslan_test.test55").explain()
== Physical Plan ==
*(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, 
Location: 
InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
>>>

{code}

Also notice that the presence of `path` serde property makes TABLE location 
show up twice - 
{quote}
InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
hdfs://epsdatalake/hive..., 
{quote}

We have some applications that create parquet tables in Hive with `path` serde 
property
and it makes data duplicate in query results. 

Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
not Spark 2.2 and later releases.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2019-05-30 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22151:
--
Comment: was deleted

(was: Is there is a workaround for this in Apache Livy? 
We're still on Spark 2.3 ..)

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2019-05-30 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852381#comment-16852381
 ] 

Ruslan Dautkhanov commented on SPARK-22151:
---

Is there is a workaround for this in Apache Livy? 
We're still on Spark 2.3 ..

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2019-05-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840018#comment-16840018
 ] 

Ruslan Dautkhanov edited comment on SPARK-15463 at 5/15/19 4:00 PM:


[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
schema inference)? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 


was (Author: tagar):
[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2019-05-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840018#comment-16840018
 ] 

Ruslan Dautkhanov commented on SPARK-15463:
---

[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15719) Disable writing Parquet summary files by default

2019-04-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814061#comment-16814061
 ] 

Ruslan Dautkhanov commented on SPARK-15719:
---

[~lian cheng] quick question on this part from the description -

{quote}
when schema merging is enabled, we need to read footers of all files anyway to 
do the merge
{quote}
Is that still accurate in current Spark 2.3/  2.4? 
I was looking ParquetFileFormat.inferSchema and it does look at 
`_common_metadata` and `_metadata` files here - 

https://github.com/apache/spark/blob/v2.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L231

or Spark would still need to look at all files in all partitions, not actually 
all parquet files? 

Thank you.

> Disable writing Parquet summary files by default
> 
>
> Key: SPARK-15719
> URL: https://issues.apache.org/jira/browse/SPARK-15719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Parquet summary files are not particular useful nowadays since
> # when schema merging is disabled, we assume schema of all Parquet part-files 
> are identical, thus we can read the footer from any part-files.
> # when schema merging is enabled, we need to read footers of all files anyway 
> to do the merge.
> On the other hand, writing summary files can be expensive because footers of 
> all part-files must be read and merged. This is particularly costly when 
> appending small dataset to large existing Parquet dataset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2019-04-04 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810124#comment-16810124
 ] 

Ruslan Dautkhanov commented on SPARK-21784:
---

Any chance this can be part of Spark 3.0 release?


> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache

2019-02-25 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777060#comment-16777060
 ] 

Ruslan Dautkhanov commented on SPARK-26764:
---

That seems to be closely related to Hive materialized views - implemented in 
Hive 3.2
HIVE-10459 



> [SPIP] Spark Relational Cache
> -
>
> Key: SPARK-26764
> URL: https://issues.apache.org/jira/browse/SPARK-26764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Adrian Wang
>Priority: Major
> Attachments: Relational+Cache+SPIP.pdf
>
>
> In modern database systems, relational cache is a common technology to boost 
> ad-hoc queries. While Spark provides cache natively, Spark SQL should be able 
> to utilize the relationship between relations to boost all possible queries. 
> In this SPIP, we will make Spark be able to utilize all defined cached 
> relations if possible, without explicit substitution in user query, as well 
> as keep some user defined cache available in different sessions. Materialized 
> views in many database systems provide similar function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695427#comment-16695427
 ] 

Ruslan Dautkhanov edited comment on SPARK-26019 at 11/22/18 12:42 AM:
--

Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
 [https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
 
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
 to avoid race condition that could happen if we start listening and accepting 
connections in main thread. 

I manually verified and it fixes things for us.

Thank you.


was (Author: tagar):
Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
[https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
to avoid race condition that happens that could happen if we start listening 
and accepting connections in main thread. 

I manually verified and it fixes things for us.

Thank you.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695427#comment-16695427
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
[https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
to avoid race condition that happens that could happen if we start listening 
and accepting connections in main thread. 

I manually verified and it fixes things for us.

Thank you.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692716#comment-16692716
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~viirya] exception stack reads that error happened in SocketServer.py, 
BaseRequestHandler class constructor, excerpt from the full exception stack 
above :

{code:python}
...
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
652, in __init__
self.handle()
...
{code}

Notice constructor here calls `self.handle()`  - 
https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L655

`handle()` is defined in derived class _UpdateRequestHandler here 
https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232
and expects `auth_token` to be set :
https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L254 
- that's exactly where exception happens. 

[~irashid] was right - those two lines have to be swapped.

[~hyukjin.kwon] that's odd you closed this jira, although I said it always 
reproduces for me (100 % of times ), 
and even [posted reproducer 
here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=13197858=16692219].
[~saivarunvishal] also said it happens for him in SPARK-26113 and you closed 
that jira as well. 
It seems not in line with https://spark.apache.org/contributing.html - 
"Contributing Bug Reports". Please let me know what I miss here.

I called out [~bersprockets] because we use Cloudera distribution of Spark and 
Cloudera has a few patches on top of open-source Spark. 
I wanted to make sure it's not Cloudera distro specific. Also we worked with 
Bruce on several other Spark issue and noticed here's in watchers list on this 
jira... Now I see that this issue is not Cloudera specific though. 



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692377#comment-16692377
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~irashid] thanks a lot for looking at this ! 
It makes sense two swap those two lines to call parent class constructor after 
auth_token has been initialized. 

We're using Cloudera's Spark, and pyspark dependencies are inside of a zip 
file, in a "immutable" parcel... 
Unfortunately there is no quick way to test it  as it has to be propagated into 
all worker nodes. [~bersprockets] any ideas how to test this? 

Thank you.


> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692237#comment-16692237
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

cc [~lucacanali] 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692233#comment-16692233
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Sorry, nope it was broken by this change - 
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650#diff-c3339bbf2b850b79445b41e9eecf57c4R249
 



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692226#comment-16692226
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Might be broken by https://github.com/apache/spark/pull/22635 change 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692219#comment-16692219
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~hyukjin.kwon] today I reproduced this first time .. but we still receive 
reports from other our users as well. 

Here's code on Spark 2.3.2 + Python 2.7.15.

Execute on a freshly created Spark session :

{code:python}

def python_major_version ():
import sys
return(sys.version_info[0])


print(python_major_version())

print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())# 
error happens here !

{code}

It always reproduces for me.

Notice that just rerunning the same code makes this error disappear.



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

Reproduced myself 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690123#comment-16690123
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

That user said he has seen this error 4-5 times, and just rerunning same code 
makes it disappear.



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689056#comment-16689056
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

[~hyukjin.kwon] I didn't request investigation. I hope creating jira and 
explaining how it happens may help somebody else to solve their problem too, 
no? 

If you haven't noticed this jira has a sequence of SQLs attached as a txt file 
that trigger this problem. There are a couple of other jiras SPARK-13480 and 
SPARK-12940 that seem relevant but were also closed as couldn't reproduce. I 
think there is a long-standing problem when Catalyst excessively overoptimizes 
and cuts some columns form lineage excessively. 

I thought by reporting problems here we help make Spark better, no? 
Unfortunately closing jira as can't reproduce doesn't make this problem 
disappear.

Having said that, I will try to make a reproducible case and upload here, in 
addition to SQLs that are already attached.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689052#comment-16689052
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

No, it was the only instance I had for this problem. I will ask again that user 
who ran into this. 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-26019.
---
Resolution: Cannot Reproduce

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Environment: 
Spark 2.3.2 

Hadoop 2.6

When we materialize one of intermediate dataframes as a parquet table, and read 
it back in, this error doesn't happen (exact same downflow queries ). 

 

  was:
Spark 2.3.2 

PySpark 2.7.15 + Hadoop 2.6

When we materialize one of intermediate dataframes as a parquet table, and read 
it back in, this error doesn't happen (exact same downflow queries ). 

 


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Comment Edited] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov edited comment on SPARK-26041 at 11/14/18 5:45 PM:
-

thanks for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.



was (Author: tagar):
thank for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> 

[jira] [Comment Edited] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov edited comment on SPARK-26041 at 11/14/18 5:45 PM:
-

thank for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.



was (Author: tagar):
thank for checking this [~mgaido] 

just attached sql that shows sequence of dataframe creation and last failing 
dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

thank for checking this [~mgaido] 

just attached sql that shows sequence of dataframe creation and last failing 
dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  

[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Attachment: SPARK-26041.txt

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686851#comment-16686851
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

Thanks for referencing that jira [~mgaido]

SPARK-26057 seems Spark 2.4 specific only from description . 
We see this problem in Spark 2.3.1 and in Spark 2.3.2 .. 

Can you check if https://github.com/apache/spark/pull/23035 is applicable to 
Spark 2.3 too? Thanks

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Affects Version/s: 2.3.0
   2.3.1

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685592#comment-16685592
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

There are a couple of related jiras that were closed as "cannot reproduce": 
SPARK-13480 and SPARK-12940 for example.

This problem doesn't happen in some cases for us too. For example, 
materializing one of the intermediate dataframes as a parquet table makes the 
workflow work normally.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685589#comment-16685589
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

Issue might be introduced by SPARK-9830 

Comment from [~marmbrus] in an older SPARK-13087: 

{quote}
but it seems there should be a deeper fix that prevents the problem instead of 
covering for it.

[~yhuai] I think this problem crept in with the changes for SPARK-9830
{quote}

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Created] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-13 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-26041:
-

 Summary: catalyst cuts out some columns from dataframes: 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute
 Key: SPARK-26041
 URL: https://issues.apache.org/jira/browse/SPARK-26041
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core
Affects Versions: 2.4.0, 2.3.2
 Environment: Spark 2.3.2 

PySpark 2.7.15 + Hadoop 2.6

When we materialize one of intermediate dataframes as a parquet table, and read 
it back in, this error doesn't happen (exact same downflow queries ). 

 
Reporter: Ruslan Dautkhanov


There is a workflow with a number of group-by's, joins, `exists` and `in`s 
between a set of dataframes. 

We are getting following exception and the reason that the Catalyst cuts some 
columns out of dataframes: 

{noformat}
Unhandled error: , An error occurred while 
calling o1187.cache.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
2011.0 (TID 832340, pc1udatahad23, execut
or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Binding attribute, tree: part_code#56012
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage39.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 

[jira] [Created] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-12 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-26019:
-

 Summary: pyspark/accumulators.py: "TypeError: object of type 
'NoneType' has no len()" in authenticate_and_accum_updates()
 Key: SPARK-26019
 URL: https://issues.apache.org/jira/browse/SPARK-26019
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0, 2.3.2
Reporter: Ruslan Dautkhanov


Started happening after 2.3.1 -> 2.3.2 upgrade.

 
{code:python}
Exception happened during processing of request from ('127.0.0.1', 43418)

Traceback (most recent call last):
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
290, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
318, in process_request
    self.finish_request(request, client_address)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
331, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
652, in __init__
    self.handle()
  File 
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
 line 263, in handle
    poll(authenticate_and_accum_updates)
  File 
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
 line 238, in poll
    if func():
  File 
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
 line 251, in authenticate_and_accum_updates
    received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()
 
{code}
 
Error happens here:
https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254

The PySpark code was just running a simple pipeline of 
binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
and then converting it to a dataframe and running a count on it.

It seems error is flaky - on next rerun it didn't happen.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681892#comment-16681892
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

Yep, the pyspark job completes fine afetr we removed ipv6 references in 
/etc/hosts 

Thank you both 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside 

[jira] [Resolved] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-25958.
---
Resolution: Not A Problem

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside of PySpark 
> and it worked fine.
> RHEL 7.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681735#comment-16681735
 ] 

Ruslan Dautkhanov commented on SPARK-24244:
---

[~maxgekk] great improvement 

is this new option available in PySpark too?

 

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679261#comment-16679261
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

[~XuanYuan] interesting.. here's our /etc/hosts:
{quote}127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
{quote}

Notice we have ipv6 stuff there, but ipv6 is disabled for us.

I will comment out `::1` and try again. 

Was it the fix for you too? 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679263#comment-16679263
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

I just removed ipv6 reference ::1 in /etc/hosts and your sample code stopped 
reporting "OSError: [Errno 97] Address family not supported by protocol".

Will try to rerun the job now. 

Thank you.

 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to 

[jira] [Comment Edited] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678894#comment-16678894
 ] 

Ruslan Dautkhanov edited comment on SPARK-25958 at 11/7/18 10:35 PM:
-

We do have ipv6 disabled on our hadoop servers, but that failing line in 
*lib/spark2/python/pyspark/rdd.py* just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}


was (Author: tagar):
We do have ipv6 disabled on our hadoop servers, but that failing line in 
lib/spark2/python/pyspark/rdd.py just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> 

[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678894#comment-16678894
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

We do have ipv6 disabled on our hadoop servers, but that failing line in 
lib/spark2/python/pyspark/rdd.py just connects to "localhost"..

 
{code:java}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused 

[jira] [Updated] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25958:
--
Issue Type: Bug  (was: New Feature)

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside of PySpark 
> and it worked fine.
> RHEL 7.5.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-07 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25958:
--
Description: 
Following error happens on a heavy Spark job after 4 hours of runtime..
{code}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 98, in generate_dataframe_main
raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 16, in get_raw_data_with_metadata_and_aggregation
main_df = 
self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
return_df = self.get_dataframe_from_binary_value_iteration(input_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 136, in get_dataframe_from_binary_value_iteration
combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
binary_value=count)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 154, in get_dataframe_from_binary_value
if len(results_of_filter_df.take(1)) == 0:
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
467, in collect
return list(_load_from_socket(sock_info, 
BatchedSerializer(PickleSerializer(
  File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
148, in _load_from_socket
sock = socket.socket(af, socktype, proto)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
__init__
_sock = _realsocket(family, type, proto)
error: [Errno 97] Address family not supported by protocol
{code}
Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
{code}
def _load_from_socket(sock_info, serializer):
port, auth_secret = sock_info
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
sock = socket.socket(af, socktype, proto)
try:
sock.settimeout(15)
sock.connect(sa)
except socket.error:
sock.close()
sock = None
continue
break
if not sock:
raise Exception("could not open socket")
# The RDD materialization time is unpredicable, if we set a timeout for 
socket reading
# operation, it will very possibly fail. See SPARK-18281.
sock.settimeout(None)

sockfile = sock.makefile("rwb", 65536)
do_server_auth(sockfile, auth_secret)

# The socket will be automatically closed when garbage-collected.
return serializer.load_stream(sockfile)
{code}
the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
{code}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}
so the error "error: [Errno 97] *Address family* not supported by protocol"

seems to be caused by socket.AF_UNSPEC third option to the socket.getaddrinfo() 
call.

I tried to call similar socket.getaddrinfo call locally outside of PySpark and 
it worked fine.

RHEL 7.5.

  was:
Following error happens on a heavy Spark job after 4 hours of runtime.. 

{code:python}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File 

[jira] [Created] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-06 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25958:
-

 Summary: error: [Errno 97] Address family not supported by 
protocol in dataframe.take()
 Key: SPARK-25958
 URL: https://issues.apache.org/jira/browse/SPARK-25958
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 2.3.2, 2.3.1
Reporter: Ruslan Dautkhanov


Following error happens on a heavy Spark job after 4 hours of runtime.. 

{code:python}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 98, in generate_dataframe_main
raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 16, in get_raw_data_with_metadata_and_aggregation
main_df = 
self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
return_df = self.get_dataframe_from_binary_value_iteration(input_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 136, in get_dataframe_from_binary_value_iteration
combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
binary_value=count)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 154, in get_dataframe_from_binary_value
if len(results_of_filter_df.take(1)) == 0:
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
467, in collect
return list(_load_from_socket(sock_info, 
BatchedSerializer(PickleSerializer(
  File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
148, in _load_from_socket
sock = socket.socket(af, socktype, proto)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
__init__
_sock = _realsocket(family, type, proto)
error: [Errno 97] Address family not supported by protocol
{code}

Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148: 

{code:python}

def _load_from_socket(sock_info, serializer):
port, auth_secret = sock_info
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
sock = socket.socket(af, socktype, proto)
try:
sock.settimeout(15)
sock.connect(sa)
except socket.error:
sock.close()
sock = None
continue
break
if not sock:
raise Exception("could not open socket")
# The RDD materialization time is unpredicable, if we set a timeout for 
socket reading
# operation, it will very possibly fail. See SPARK-18281.
sock.settimeout(None)

sockfile = sock.makefile("rwb", 65536)
do_server_auth(sockfile, auth_secret)

# The socket will be automatically closed when garbage-collected.
return serializer.load_stream(sockfile)
{code}

the culprint is in the line 

{code:python}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

so the error "error: [Errno 97] *Address family* not supported by protocol" 

seems to be caused by socket.AF_UNSPEC third option to the socket.getaddrinfo() 
call.

I tried to call similar socket.getaddrinfo call locally outside of PySpark and 
it worked fine.

RHEL 7.5.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.

2018-10-29 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667721#comment-16667721
 ] 

Ruslan Dautkhanov edited comment on SPARK-25863 at 10/29/18 8:37 PM:
-

[~mgaido], I will try to get a reproducer, but it might be a tough task, not 
sure yet as it might depend on data I guess too. 

If this helps, I can tell that this Spark job was operating on a very wide 
table (thousands of columns),
 and the SQL itself was generated and have a long SELECT clause with a lot of 
CASE statements...
 so I can imaging Spark's Code Generation had a hard time to crank through this.

Is there is some debugging we can enable that would be helpful to get to root 
cause?

On that line particularly
[https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1475]
Would it make sense just to add an `if` codeSizes is empty, return zero. If 
codeSizes never supposed to be empty, add some assert before that to help with 
debugging. 

Thank you.


was (Author: tagar):
[~mgaido], I will try to get a reproducer, but it might be a tough task, not 
sure yet as it might depend on data I guess too. 

If this helps, I can tell that this Spark job was operating on a very wide 
table (thousands of columns),
and the SQL itself was generated and have a long SELECT clause with a lot of 
CASE statements...
so I can imaging Spark's Code Generation had a hard time to crank through this.

Is there is some debugging we can enable that would be helpful to get to root 
cause?

Thank you.

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> 

[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2018-10-29 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667721#comment-16667721
 ] 

Ruslan Dautkhanov commented on SPARK-25863:
---

[~mgaido], I will try to get a reproducer, but it might be a tough task, not 
sure yet as it might depend on data I guess too. 

If this helps, I can tell that this Spark job was operating on a very wide 
table (thousands of columns),
and the SQL itself was generated and have a long SELECT clause with a lot of 
CASE statements...
so I can imaging Spark's Code Generation had a hard time to crank through this.

Is there is some debugging we can enable that would be helpful to get to root 
cause?

Thank you.

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2018-10-29 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667478#comment-16667478
 ] 

Ruslan Dautkhanov commented on SPARK-22505:
---

Thank you! That worked

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1736#comment-1736
 ] 

Ruslan Dautkhanov commented on SPARK-25863:
---

It seems error happens here

[https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1475]

but this is as far as I can go... any ideas why this happens? thanks!

 

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at 

[jira] [Updated] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25863:
--
Affects Version/s: 2.3.1

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> 

[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1726#comment-1726
 ] 

Ruslan Dautkhanov commented on SPARK-25863:
---

This happens only on one of our heaviest Spark jobs.. 

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> 

[jira] [Created] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1

2018-10-28 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25863:
-

 Summary: java.lang.UnsupportedOperationException: empty.max at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
 Key: SPARK-25863
 URL: https://issues.apache.org/jira/browse/SPARK-25863
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core
Affects Versions: 2.3.2
Reporter: Ruslan Dautkhanov


Failing task : 
{noformat}
An error occurred while calling o2875.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in 
stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
21413.0 (TID 4057314, pc1udatahad117, executor 431): 
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
at scala.collection.AbstractTraversable.max(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{noformat}
 

Driver stack trace:
{noformat}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596)
at 

[jira] [Commented] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661282#comment-16661282
 ] 

Ruslan Dautkhanov commented on SPARK-25814:
---

thank you [~vanzin] ! I will try to tune those down and see if this help.

 

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Priority: Major  (was: Critical)

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 

!image-2018-10-23-14-06-53-722.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
> !image-2018-10-23-14-03-12-258.png!
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Attachment: image-2018-10-23-14-06-53-722.png

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
> !image-2018-10-23-14-03-12-258.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25814:
-

 Summary: spark driver runs out of memory on 
org.apache.spark.util.kvstore.InMemoryStore
 Key: SPARK-25814
 URL: https://issues.apache.org/jira/browse/SPARK-25814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2, 2.2.2
Reporter: Ruslan Dautkhanov
 Attachments: image-2018-10-23-14-06-53-722.png

 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2018-10-22 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659374#comment-16659374
 ] 

Ruslan Dautkhanov commented on SPARK-13587:
---

We're using conda environments shared across worker nodes through NFS. Has 
anyone used something like this?

Another option that' more direct to this jira's description is `conda-pack` and 
using yarn's `--archives` option to distribute it:

{code:bash}
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py
{code}

More details - https://conda.github.io/conda-pack/spark.html



> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-10-21 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658401#comment-16658401
 ] 

Ruslan Dautkhanov commented on SPARK-22947:
---

Perhaps at least part of implementation can be based on 
https://github.com/twosigma/flint library?
Particularly merge-join 
https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/function/join/RangeMergeJoin.scala

ts-flint has concepts of temporal joins with tolerances:
https://github.com/twosigma/flint#temporal-join-functions



> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, 

[jira] [Commented] (SPARK-25643) Performance issues querying wide rows

2018-10-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652039#comment-16652039
 ] 

Ruslan Dautkhanov commented on SPARK-25643:
---

[~viirya] we confirm this problem on our production workloads too. Realizing 
wide tables that have columnar backends is super expensive. In comments of 
SPARK-25164 you can see that reading *even simple queries of fetching 70k rows 
takes 20 minutes* in a tables with 10m records. 

It would be great if Spark have optimizations to realize only columns that are 
required in `where` clause first, and after filtering realize rest of columns 
perhaps - it seems this would fix this huge performance overhead on wide 
datasets. 

Some key piece from [~bersprockets]'s findings are

{quote}According to initial profiling, it appears that most time is spent 
realizing the entire row in the scan, just so the filter can look at a tiny 
subset of columns and almost certainly throw the row away .. The profiling 
shows 74% of time is spent in FileSourceScanExec{quote}



> Performance issues querying wide rows
> -
>
> Key: SPARK-25643
> URL: https://issues.apache.org/jira/browse/SPARK-25643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2018-10-05 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263733#comment-16263733
 ] 

Ruslan Dautkhanov edited comment on SPARK-22505 at 10/5/18 8:26 PM:


[~hyukjin.kwon], I tried to convert an rdd containg csv and it doesn't seem to 
infer data types correctly:

{code:python}rdd1 = sc.parallelize([('1','a'),('2','b'),('3','c')])
df = spark.read.csv(rdd1){code}

and 

{code:python}
rdd2 = sc.parallelize([('1,a'),('2,b'),('3,c')])
df = spark.read.csv(rdd2) 
{code}

- in both cases `df.printSchema()` print 

{noformat}
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
{noformat}

while the first column has to have `int` as inferred data type.


was (Author: tagar):
that's great. thank you [~hyukjin.kwon]

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-10-05 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640045#comment-16640045
 ] 

Ruslan Dautkhanov commented on SPARK-25164:
---

Thank you [~bersprockets] - SPARK-25643 would be a huge improvement for wider 
datasets,
but will also be helpful for querying performance on normal dataframes too.

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020
 ] 

Ruslan Dautkhanov commented on SPARK-25164:
---

Hi [~bersprockets]

 

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020
 ] 

Ruslan Dautkhanov edited comment on SPARK-25164 at 9/13/18 8:19 PM:


Hi [~bersprockets]

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 


was (Author: tagar):
Hi [~bersprockets]

 

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-11 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611089#comment-16611089
 ] 

Ruslan Dautkhanov edited comment on SPARK-25164 at 9/11/18 7:00 PM:


Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *reading 70k rows for over 10 minutes* with multiple executors is 
still quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
 Is there is any way to optimize this further?

Thanks.

 


was (Author: tagar):
Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *saving 70k rows for over 10 minutes* with multiple executors is still 
quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
Is there is any way to optimize this further?

Thanks.

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-11 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611089#comment-16611089
 ] 

Ruslan Dautkhanov commented on SPARK-25164:
---

Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *saving 70k rows for over 10 minutes* with multiple executors is still 
quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
Is there is any way to optimize this further?

Thanks.

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table

2018-09-04 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603590#comment-16603590
 ] 

Ruslan Dautkhanov commented on SPARK-24316:
---

Thanks [~bersprockets] 

Is cloudera spark.2.3.cloudera3 parcel based on upstream Spark 2.3.*2*?

As we still see this issue with latest Cloudera's Spark 2.3 parcel ("2.3 
release 3").

 

> Spark sql queries stall for  column width more than 6k for parquet based table
> --
>
> Key: SPARK-24316
> URL: https://issues.apache.org/jira/browse/SPARK-24316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> When we create a table from a data frame using spark sql with columns around 
> 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while 
> the same table if we create through Hive with same data , the same query just 
> takes 5 minutes.
>  
> Instrumenting the code we see that the executors are looping in the while 
> loop of the function initializeInternal(). The majority of time is getting 
> spent in the for loop in below code looping through the columns and the 
> executor appears to be stalled for long time .
>   
> {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}
> private void initializeInternal() ..
>  ..
>  for (int i = 0; i < requestedSchema.getFieldCount(); ++i)
> { ... }
> }
> {code:java}
>  {code}
>  
> When spark sql is creating table, it also stores the metadata in the 
> TBLPROPERTIES in json format. We see that if we remove this metadata from the 
> table the queries become fast , which is the case when we create the same 
> table through Hive. The exact same table takes 5 times more time with the 
> Json meta data as compared to without the json metadata.
>  
> So looks like as the number of columns are growing bigger than 5 to 6k, the 
> processing of the metadata and comparing it becomes more and more expensive 
> and the performance degrades drastically.
> To recreate the problem simply run the following query:
> import org.apache.spark.sql.SparkSession
> val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")
>  resp_data.write.format("csv").save("/tmp/filename")
>  
> The table should be created by spark sql from dataframe so that the Json meta 
> data is stored. For ex:-
> val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")
> dff.createOrReplaceTempView("my_temp_table")
>  val tmp = spark.sql("Create table tableName stored as parquet as select * 
> from my_temp_table")
>  
>  
> from pyspark.sql import SQL
> Context 
>  sqlContext = SQLContext(sc) 
>  resp_data = spark.sql( " select * from test").limit(2000) 
>  print resp_data_fgv_1k.count() 
>  (resp_data_fgv_1k.write.option('header', 
> False).mode('overwrite').csv('/tmp/2.csv') ) 
>  
>  
> The performance seems to be even slow in the loop if the schema does not 
> match or the fields are empty and the code goes into the if condition where 
> the missing column is marked true:
> missingColumns[i] = true;
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >