[jira] [Created] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-01-16 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-30525:
-

 Summary: HiveTableScanExec do not need to prune partitions again 
after pushing down to hive metastore
 Key: SPARK-30525
 URL: https://issues.apache.org/jira/browse/SPARK-30525
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Hu Fuwang


In HiveTableScanExec, it will push down to hive metastore for partition pruning 
if 

spark.sql.hive.metastorePartitionPruning is true, and then it will prune the 
returned partitions again using partition filters, because some predicates, eg. 
"b like 'xyz'", are not supported in hive metastore. But now this problem is 
already fixed in HiveExternalCatalog.listPartitionsByFilter, the 
HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. 
So it is not necessary any more to double prune in HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-01-16 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30525:
--
Description: In HiveTableScanExec, it will push down to hive metastore for 
partition pruning if _spark.sql.hive.metastorePartitionPruning_ is true, and 
then it will prune the returned partitions again using partition filters, 
because some predicates, eg. "b like 'xyz'", are not supported in hive 
metastore. But now this problem is already fixed in 
HiveExternalCatalog.listPartitionsByFilter, the 
HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. 
So it is not necessary any more to double prune in HiveTableScanExec.  (was: In 
HiveTableScanExec, it will push down to hive metastore for partition pruning if 
spark.sql.hive.metastorePartitionPruning is true, and then it will prune the 
returned partitions again using partition filters, because some predicates, eg. 
"b like 'xyz'", are not supported in hive metastore. But now this problem is 
already fixed in HiveExternalCatalog.listPartitionsByFilter, the 
HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. 
So it is not necessary any more to double prune in HiveTableScanExec.)

> HiveTableScanExec do not need to prune partitions again after pushing down to 
> hive metastore
> 
>
> Key: SPARK-30525
> URL: https://issues.apache.org/jira/browse/SPARK-30525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> In HiveTableScanExec, it will push down to hive metastore for partition 
> pruning if _spark.sql.hive.metastorePartitionPruning_ is true, and then it 
> will prune the returned partitions again using partition filters, because 
> some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But 
> now this problem is already fixed in 
> HiveExternalCatalog.listPartitionsByFilter, the 
> HiveExternalCatalog.listPartitionsByFilter can return exactly what we want 
> now. So it is not necessary any more to double prune in HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-01-16 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30525:
--
Description: In HiveTableScanExec, it will push down to hive metastore for 
partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then 
it will prune the returned partitions again using partition filters, because 
some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But 
now this problem is already fixed in 
HiveExternalCatalog.listPartitionsByFilter, the 
HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. 
So it is not necessary any more to double prune in HiveTableScanExec.  (was: In 
HiveTableScanExec, it will push down to hive metastore for partition pruning if 

spark.sql.hive.metastorePartitionPruning is true, and then it will prune the 
returned partitions again using partition filters, because some predicates, eg. 
"b like 'xyz'", are not supported in hive metastore. But now this problem is 
already fixed in HiveExternalCatalog.listPartitionsByFilter, the 
HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. 
So it is not necessary any more to double prune in HiveTableScanExec.)

> HiveTableScanExec do not need to prune partitions again after pushing down to 
> hive metastore
> 
>
> Key: SPARK-30525
> URL: https://issues.apache.org/jira/browse/SPARK-30525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> In HiveTableScanExec, it will push down to hive metastore for partition 
> pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will 
> prune the returned partitions again using partition filters, because some 
> predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now 
> this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, 
> the HiveExternalCatalog.listPartitionsByFilter can return exactly what we 
> want now. So it is not necessary any more to double prune in 
> HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30526) Can I translate Spark documents into Chinese ?

2020-01-16 Thread WangQiang Yang (Jira)
WangQiang Yang created SPARK-30526:
--

 Summary: Can I translate Spark documents into Chinese ?
 Key: SPARK-30526
 URL: https://issues.apache.org/jira/browse/SPARK-30526
 Project: Spark
  Issue Type: Question
  Components: Documentation
Affects Versions: 2.4.4
Reporter: WangQiang Yang


Can I translate Spark documents into Chinese for everyone to learn, will I face 
legal risks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event

2020-01-16 Thread Ajith S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S reopened SPARK-23626:
-

Old PR was closed due to inactivity. Reopening with a new PR and hence to 
conclude

>  DAGScheduler blocked due to JobSubmitted event
> ---
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.1, 2.3.3, 2.4.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30527) Add IsNotNull filter when use In, InSet and InSubQuery

2020-01-16 Thread ulysses you (Jira)
ulysses you created SPARK-30527:
---

 Summary: Add IsNotNull filter when use In, InSet and InSubQuery
 Key: SPARK-30527
 URL: https://issues.apache.org/jira/browse/SPARK-30527
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

2020-01-16 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-30395.

Resolution: Duplicate

Duplicate with https://issues.apache.org/jira/browse/SPARK-30276

> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> 
>
> Key: SPARK-30395
> URL: https://issues.apache.org/jira/browse/SPARK-30395
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986
> This ticket will support:
> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> such as:
>  
> {code:java}
> select sum(distinct id) filter (where sex = 'man') from student;
> select class_id, sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
> sex = 'man') from student;
> select class_id, count(id) filter (where class_id = 1), sum(distinct id) 
> filter (where sex = 'man') from student group by class_id;
> select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
> student;
> select class_id, sum(distinct id), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(id), count(id) filter (where class_id = 1), 
> sum(distinct id), sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> {code}
>  
>  but not support:
>  
> {code:java}
> select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(distinct sex) filter (where class_id = 1), 
> sum(distinct id) filter (where sex = 'man') from student group by 
> class_id;{code}
> https://issues.apache.org/jira/browse/SPARK-30396 used for later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30396) When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause

2020-01-16 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-30396.

Resolution: Duplicate

Duplicate with https://issues.apache.org/jira/browse/SPARK-30276

> When there are multiple DISTINCT aggregate expressions acting on different 
> fields, any DISTINCT aggregate expression allows the use of the FILTER clause
> 
>
> Key: SPARK-30396
> URL: https://issues.apache.org/jira/browse/SPARK-30396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986
> This ticket will support:
> When there are multiple DISTINCT aggregate expressions acting on different 
> fields, any DISTINCT aggregate expression allows the use of the FILTER clause
>  such as:
> select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(distinct sex) filter (where class_id = 1), 
> sum(distinct id) filter (where sex = 'man') from student group by class_id;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016826#comment-17016826
 ] 

Gabor Somogyi commented on SPARK-12312:
---

The PR is discontinued for long time so I've picked this up and working on it.

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30528) DPP issues

2020-01-16 Thread Mayur Bhosale (Jira)
Mayur Bhosale created SPARK-30528:
-

 Summary: DPP issues
 Key: SPARK-30528
 URL: https://issues.apache.org/jira/browse/SPARK-30528
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.0.0
Reporter: Mayur Bhosale


In DPP, heuristics to decide if DPP is going to benefit relies on the sizes of 
the tables in the right subtree of the join. This might not be a correct 
estimate especially when the detailed column level stats are not available.
{code:java}
// the pruning overhead is the total size in bytes of all scan relations
val overhead = 
otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
{code}
Also, DPP executes the entire right side of the join as a subquery because of 
which multiple scans happen for the tables in the right subtree of the join. 
This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
the subquery result does not happen. Also, I couldn’t figure out, why do the 
results from the subquery get re-used only for BHJ?

{{}}

{{}}

Consider a query,
{code:java}
SELECT * 
FROM   store_sales_partitioned 
   JOIN (SELECT * 
 FROM   store_returns_partitioned, 
date_dim 
 WHERE  sr_returned_date_sk = d_date_sk) ret_date 
 ON ss_sold_date_sk = d_date_sk 
WHERE  d_fy_quarter_seq > 0 
{code}
DPP will kick-in for both the joins and this is how the Spark plan will look 
like.

!image-2020-01-16-17-42-22-271.png!

Some of the observations - 
 * Based on heuristics, DPP would go ahead with pruning if the cost of scanning 
the tables in the right sub-tree of the join is less than the benefit due to 
pruning. This is due to the reason that multiple scans will be needed for an 
SMJ. But heuristics simply checks if the benefits offset the cost of multiple 
scans and do not take into consideration other operations like Join, etc in the 
right subtree which can be quite expensive. This issue will be particularly 
prominent when detailed column level stats are not available. In the example 
above, a decision that pruningHasBenefit was made on the basis of sizes of the 
tables store_returns_partitioned and date_dim but did not take into 
consideration the join between them before the join happens with the 
store_sales_partitioned table.

 * Multiple scans are needed when the join is SMJ as the reuse of the exchanges 
does not happen. This is because Aggregate gets added on top of the right 
subtree to be executed as a subquery in order to prune only required columns. 
Here, scanning all the columns as the right subtree of the join would, and 
reusing the same exchange might be more helpful as it avoids duplicate scans.

This was just a representative example, but in-general for cases such as below, 
DPP can cause performance issues.

!image-2020-01-16-17-36-06-781.png!

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30528) DPP issues

2020-01-16 Thread Mayur Bhosale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Bhosale updated SPARK-30528:
--
Attachment: plan.png
cases.png

> DPP issues
> --
>
> Key: SPARK-30528
> URL: https://issues.apache.org/jira/browse/SPARK-30528
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: Mayur Bhosale
>Priority: Major
>  Labels: performance
> Attachments: cases.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
> // the pruning overhead is the total size in bytes of all scan relations
> val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
> filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
> {{}}
> {{}}
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>JOIN (SELECT * 
>  FROM   store_returns_partitioned, 
> date_dim 
>  WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>  ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the joins and this is how the Spark plan will look 
> like.
> !image-2020-01-16-17-42-22-271.png!
> Some of the observations - 
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as 
> below, DPP can cause performance issues.
> !image-2020-01-16-17-36-06-781.png!
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30528) DPP issues

2020-01-16 Thread Mayur Bhosale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Bhosale updated SPARK-30528:
--
Description: 
In DPP, heuristics to decide if DPP is going to benefit relies on the sizes of 
the tables in the right subtree of the join. This might not be a correct 
estimate especially when the detailed column level stats are not available.
{code:java}
// the pruning overhead is the total size in bytes of all scan relations
val overhead = 
otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
{code}
Also, DPP executes the entire right side of the join as a subquery because of 
which multiple scans happen for the tables in the right subtree of the join. 
This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
the subquery result does not happen. Also, I couldn’t figure out, why do the 
results from the subquery get re-used only for BHJ?

 

Consider a query,
{code:java}
SELECT * 
FROM   store_sales_partitioned 
   JOIN (SELECT * 
 FROM   store_returns_partitioned, 
date_dim 
 WHERE  sr_returned_date_sk = d_date_sk) ret_date 
 ON ss_sold_date_sk = d_date_sk 
WHERE  d_fy_quarter_seq > 0 
{code}
DPP will kick-in for both the join. (Please check the image attached below for 
the plan)

Some of the observations -
 * Based on heuristics, DPP would go ahead with pruning if the cost of scanning 
the tables in the right sub-tree of the join is less than the benefit due to 
pruning. This is due to the reason that multiple scans will be needed for an 
SMJ. But heuristics simply checks if the benefits offset the cost of multiple 
scans and do not take into consideration other operations like Join, etc in the 
right subtree which can be quite expensive. This issue will be particularly 
prominent when detailed column level stats are not available. In the example 
above, a decision that pruningHasBenefit was made on the basis of sizes of the 
tables store_returns_partitioned and date_dim but did not take into 
consideration the join between them before the join happens with the 
store_sales_partitioned table.

 * Multiple scans are needed when the join is SMJ as the reuse of the exchanges 
does not happen. This is because Aggregate gets added on top of the right 
subtree to be executed as a subquery in order to prune only required columns. 
Here, scanning all the columns as the right subtree of the join would, and 
reusing the same exchange might be more helpful as it avoids duplicate scans.

This was just a representative example, but in-general for cases such as in the 
image below, DPP can cause performance issues.

 

  was:
In DPP, heuristics to decide if DPP is going to benefit relies on the sizes of 
the tables in the right subtree of the join. This might not be a correct 
estimate especially when the detailed column level stats are not available.
{code:java}
// the pruning overhead is the total size in bytes of all scan relations
val overhead = 
otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
{code}
Also, DPP executes the entire right side of the join as a subquery because of 
which multiple scans happen for the tables in the right subtree of the join. 
This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
the subquery result does not happen. Also, I couldn’t figure out, why do the 
results from the subquery get re-used only for BHJ?

{{}}

{{}}

Consider a query,
{code:java}
SELECT * 
FROM   store_sales_partitioned 
   JOIN (SELECT * 
 FROM   store_returns_partitioned, 
date_dim 
 WHERE  sr_returned_date_sk = d_date_sk) ret_date 
 ON ss_sold_date_sk = d_date_sk 
WHERE  d_fy_quarter_seq > 0 
{code}
DPP will kick-in for both the joins and this is how the Spark plan will look 
like.

!image-2020-01-16-17-42-22-271.png!

Some of the observations - 
 * Based on heuristics, DPP would go ahead with pruning if the cost of scanning 
the tables in the right sub-tree of the join is less than the benefit due to 
pruning. This is due to the reason that multiple scans will be needed for an 
SMJ. But heuristics simply checks if the benefits offset the cost of multiple 
scans and do not take into consideration other operations like Join, etc in the 
right subtree which can be quite expensive. This issue will be particularly 
prominent when detailed column level stats are not available. In the example 
above, a decision that pruningHasBenefit was made on the basis of sizes of the 
tables store_returns_partitioned and date_dim but did not take into 
consideration the join between them before the join happens with the 
store_sales_partitioned table.

 * Multiple scans are needed when the joi

[jira] [Commented] (SPARK-30337) Convert case class with var to normal class in spark-sql-kafka module

2020-01-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016866#comment-17016866
 ] 

Gabor Somogyi commented on SPARK-30337:
---

[~hyukjin.kwon] [~kabhwan] shall we resolve this jira since the linked PR 
merged?

> Convert case class with var to normal class in spark-sql-kafka module
> -
>
> Key: SPARK-30337
> URL: https://issues.apache.org/jira/browse/SPARK-30337
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> There was a review comment in SPARK-25151 pointed out this, but we decided to 
> mark it as TODO as it was having 300+ comments and didn't want to drag it 
> more.
> This issue tracks the effort on addressing TODO comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30483) Job History does not show pool properties table

2020-01-16 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016873#comment-17016873
 ] 

pavithra ramachandran commented on SPARK-30483:
---

Issue is resolved in master and 2.4

 

[https://github.com/apache/spark/commit/6d90298438e627187088a5d8c53d470646d051f4]

> Job History does not show pool properties table
> ---
>
> Key: SPARK-30483
> URL: https://issues.apache.org/jira/browse/SPARK-30483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Stage will show the Pool Name column but when user clicks the hyper link Name>  it will not redirect to Pool Properties Table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30337) Convert case class with var to normal class in spark-sql-kafka module

2020-01-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-30337.
--
Resolution: Invalid

We just removed TODO.

> Convert case class with var to normal class in spark-sql-kafka module
> -
>
> Key: SPARK-30337
> URL: https://issues.apache.org/jira/browse/SPARK-30337
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> There was a review comment in SPARK-25151 pointed out this, but we decided to 
> mark it as TODO as it was having 300+ comments and didn't want to drag it 
> more.
> This issue tracks the effort on addressing TODO comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30507) TableCalalog reserved properties shoudn't be changed with options or tblpropeties

2020-01-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30507.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27197
[https://github.com/apache/spark/pull/27197]

> TableCalalog reserved properties shoudn't be changed with options or 
> tblpropeties
> -
>
> Key: SPARK-30507
> URL: https://issues.apache.org/jira/browse/SPARK-30507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Instead of using `OPTIONS (k='v')` or TBLPROPERTIES  (k='v'), if k is a 
> reserved TableCatalog property, we should use its specific syntax to 
> add/modify/delete it.
> e.g. provider is a reserved property, we should use the USING clause to 
> specify it, and should not allow ALTER TABLE ... UNSET 
> TBLPROPERTIES('provider') to delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30507) TableCalalog reserved properties shoudn't be changed with options or tblpropeties

2020-01-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30507:
---

Assignee: Kent Yao

> TableCalalog reserved properties shoudn't be changed with options or 
> tblpropeties
> -
>
> Key: SPARK-30507
> URL: https://issues.apache.org/jira/browse/SPARK-30507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Instead of using `OPTIONS (k='v')` or TBLPROPERTIES  (k='v'), if k is a 
> reserved TableCatalog property, we should use its specific syntax to 
> add/modify/delete it.
> e.g. provider is a reserved property, we should use the USING clause to 
> specify it, and should not allow ALTER TABLE ... UNSET 
> TBLPROPERTIES('provider') to delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016974#comment-17016974
 ] 

Sean R. Owen commented on SPARK-27750:
--

I don't think Spark or a resource manager can save you from this entirely. What 
would you do - not launch an app because some other app might want resources 
later? and how would Spark even know there are other drivers? RMs can use pools 
to prevent too much resource going to one user.

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30524:
---

Assignee: Ke Jia

> Disable OptimizeSkewJoin rule if introducing additional shuffle.
> 
>
> Key: SPARK-30524
> URL: https://issues.apache.org/jira/browse/SPARK-30524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And 
> it may introduce additional shuffle after apply the OptimizeSkewedJoin. This 
> PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30524.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27226
[https://github.com/apache/spark/pull/27226]

> Disable OptimizeSkewJoin rule if introducing additional shuffle.
> 
>
> Key: SPARK-30524
> URL: https://issues.apache.org/jira/browse/SPARK-30524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And 
> it may introduce additional shuffle after apply the OptimizeSkewedJoin. This 
> PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30529) Improve error messages when Executor dies before registering with driver

2020-01-16 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30529:
-

 Summary: Improve error messages when Executor dies before 
registering with driver
 Key: SPARK-30529
 URL: https://issues.apache.org/jira/browse/SPARK-30529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


currently when you give a bad configuration for accelerator aware scheduling to 
the executor, the Executors can die but its hard for the user to know why.  The 
executor dies and logs in its log files what went wrong but many times it hard 
to find those logs because the executor hasn't registered yet.  Since it hasn't 
registered the executor doesn't show up on UI to see log files.

One specific example is you give a discovery script that that doesn't find all 
the GPUs:

20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@10.28.9.112:44403
20/01/16 08:59:24 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
addresses: 0 is less than what the user requested: 2)
 at scala.Predef$.require(Predef.scala:281)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)

 

Figure out a better way of logging or letting user know  what error occurred 
when the executor dies before registering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results

2020-01-16 Thread Jason Darrell Lowe (Jira)
Jason Darrell Lowe created SPARK-30530:
--

 Summary: CSV load followed by "is null" filter produces incorrect 
results
 Key: SPARK-30530
 URL: https://issues.apache.org/jira/browse/SPARK-30530
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jason Darrell Lowe


Trying to filter on is null from values loaded from a CSV file has regressed 
recently and now produces incorrect results.

Given a CSV file with the contents:
{noformat:title=floats.csv}
100.0,1.0,
200.0,,
300.0,3.0,
1.0,4.0,
,4.0,
500.0,,
,6.0,
-500.0,50.5
 {noformat}
Filtering this data for the first column being null should return exactly two 
rows, but it is returning extraneous rows with nulls:
{noformat}
scala> val schema = StructType(Array(StructField("floats", FloatType, 
true),StructField("more_floats", FloatType, true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(floats,FloatType,true), 
StructField(more_floats,FloatType,true))

scala> val df = spark.read.schema(schema).csv("floats.csv")
df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float]

scala> df.filter("floats is null").show
+--+---+
|floats|more_floats|
+--+---+
|  null|   null|
|  null|   null|
|  null|   null|
|  null|   null|
|  null|4.0|
|  null|   null|
|  null|6.0|
+--+---+
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-16 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017085#comment-17017085
 ] 

Wenchen Fan commented on SPARK-27750:
-

do you hit this issue in the real world?

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results

2020-01-16 Thread Jason Darrell Lowe (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017088#comment-17017088
 ] 

Jason Darrell Lowe commented on SPARK-30530:


The regressed behavior was introduced by this commit:

{noformat}
commit 4e50f0291f032b4a5c0b46ed01fdef14e4cbb050
Author: Maxim Gekk 
Date: Thu Jan 16 13:10:08 2020 +0900

[SPARK-30323][SQL] Support filters pushdown in CSV datasource
{noformat}

[~maxgekk] would you take a look?


> CSV load followed by "is null" filter produces incorrect results
> 
>
> Key: SPARK-30530
> URL: https://issues.apache.org/jira/browse/SPARK-30530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> Trying to filter on is null from values loaded from a CSV file has regressed 
> recently and now produces incorrect results.
> Given a CSV file with the contents:
> {noformat:title=floats.csv}
> 100.0,1.0,
> 200.0,,
> 300.0,3.0,
> 1.0,4.0,
> ,4.0,
> 500.0,,
> ,6.0,
> -500.0,50.5
>  {noformat}
> Filtering this data for the first column being null should return exactly two 
> rows, but it is returning extraneous rows with nulls:
> {noformat}
> scala> val schema = StructType(Array(StructField("floats", FloatType, 
> true),StructField("more_floats", FloatType, true)))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(floats,FloatType,true), 
> StructField(more_floats,FloatType,true))
> scala> val df = spark.read.schema(schema).csv("floats.csv")
> df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float]
> scala> df.filter("floats is null").show
> +--+---+
> |floats|more_floats|
> +--+---+
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|4.0|
> |  null|   null|
> |  null|6.0|
> +--+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-16 Thread Marcelo Masiero Vanzin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017281#comment-17017281
 ] 

Marcelo Masiero Vanzin commented on SPARK-27868:


You shoudn't have reverted the whole change. The documentation and extra 
logging are still really useful.

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30531) Duplicate query plan on Spark UI SQL page

2020-01-16 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-30531:
-

 Summary: Duplicate query plan on Spark UI SQL page
 Key: SPARK-30531
 URL: https://issues.apache.org/jira/browse/SPARK-30531
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Enrico Minack


When you save a Spark UI SQL query page to disk and then display the html file 
with your browser, the query plan will be rendered a second time. This change 
avoids rendering the plan visualization when it exists already.

 

!https://user-images.githubusercontent.com/44700269/72543429-fcb8d980-3885-11ea-82aa-c0b3638847e5.png!

The fix does not call {{renderPlanViz()}} when the plan exists already:

!https://user-images.githubusercontent.com/44700269/72543641-57523580-3886-11ea-8cdf-5fb0cdffa983.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results

2020-01-16 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017290#comment-17017290
 ] 

Maxim Gekk commented on SPARK-30530:


[~jlowe] Thank you for the bug report. I will take a look at it.

> CSV load followed by "is null" filter produces incorrect results
> 
>
> Key: SPARK-30530
> URL: https://issues.apache.org/jira/browse/SPARK-30530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> Trying to filter on is null from values loaded from a CSV file has regressed 
> recently and now produces incorrect results.
> Given a CSV file with the contents:
> {noformat:title=floats.csv}
> 100.0,1.0,
> 200.0,,
> 300.0,3.0,
> 1.0,4.0,
> ,4.0,
> 500.0,,
> ,6.0,
> -500.0,50.5
>  {noformat}
> Filtering this data for the first column being null should return exactly two 
> rows, but it is returning extraneous rows with nulls:
> {noformat}
> scala> val schema = StructType(Array(StructField("floats", FloatType, 
> true),StructField("more_floats", FloatType, true)))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(floats,FloatType,true), 
> StructField(more_floats,FloatType,true))
> scala> val df = spark.read.schema(schema).csv("floats.csv")
> df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float]
> scala> df.filter("floats is null").show
> +--+---+
> |floats|more_floats|
> +--+---+
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|4.0|
> |  null|   null|
> |  null|6.0|
> +--+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks intentionally killed speculative tasks as pending leads to holding idle executors

2020-01-16 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks fail/get killed, they are still considered as pending 
and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (152 + 3) / 4 = 38.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-28403 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (152 + 3) / 4 = 38.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-28403 too

 

 

 


> Spark marks intentionally killed speculative tasks as pending leads to 
> holding idle executors
> -
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks fail/get killed, they are still considered as pending 
> and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> foun

[jira] [Updated] (SPARK-30511) Spark marks intentionally killed speculative tasks as pending leads to holding idle executors

2020-01-16 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Summary: Spark marks intentionally killed speculative tasks as pending 
leads to holding idle executors  (was: Spark marks ended speculative tasks as 
pending leads to holding idle executors)

> Spark marks intentionally killed speculative tasks as pending leads to 
> holding idle executors
> -
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> An easy repro of the issue (`--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
> cluster mode):
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index < 300 && index >= 150) {
> Thread.sleep(index * 1000) // Fake running tasks
> } else if (index == 300) {
> Thread.sleep(1000 * 1000) // Fake long running tasks
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}
> You will see when running the last task, we would be hold 38 executors (see 
> attachment), which is exactly (152 + 3) / 4 = 38.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-28403 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30491) Enable dependency audit files to tell dependency classifier

2020-01-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-30491:
-
Description: 
Current dependency audit files under `dev/deps` only show jar names. And there 
isn't a simple rule on how to parse the jar name to get the values of different 
fields. For example, `hadoop2` is the classifier of 
`avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of 
`htrace-core-3.1.0-incubating.jar`.

Thus, I propose to enable dependency audit files to tell the value of artifact 
id, version, and classifier of a dependency. For example, 
`avro-mapred-1.8.2-hadoop2.jar` should be expanded to 
`avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` 
is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier.

In this way, dependency audit files are able to be consumed by automated tests 
or downstream tools. There is a good example of the downstream tool that would 
be enabled:
Say we have a Spark application that depends on a third-party dependency `foo`, 
which pulls in `jackson` as a transient dependency. Unfortunately, `foo` 
depends on a different version of `jackson` than Spark. So, in the pom of this 
Spark application, we use the dependency management section to pin the version 
of `jackson`. By doing this, we are lifting `jackson` to the top-level 
dependency of my application and I want to have a way to keep tracking what 
Spark uses. What we can do is to cross-check my Spark application's classpath 
with what Spark uses. Then, with a test written in my code base, whenever my 
application bumps Spark version, this test will check what we define in the 
application and what Spark has, and then remind us to change our application's 
pom if needed. In my case, I am fine to directly access git to get these audit 
files.

 

  was:
Dependency audit files under `dev/deps` only show jar names. Given that, it is 
not trivial to figure out the dependency classifiers.

For example, `avro-mapred-1.8.2-hadoop2.jar` is made up of artifact id 
`avro-mapred`, version `1.8.2`, and classifier `hadoop2`. In contrast, 
`htrace-core-3.1.0-incubating.jar` is made up of artifact id `htrace-core`, and 
version `3.1.0-incubating.jar`. 

All in all, the classifier can't be told from its position in jar name, 
however, as part of the identifier of dependency, it should be clearly figured 
out.

 


> Enable dependency audit files to tell dependency classifier
> ---
>
> Key: SPARK-30491
> URL: https://issues.apache.org/jira/browse/SPARK-30491
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> Current dependency audit files under `dev/deps` only show jar names. And 
> there isn't a simple rule on how to parse the jar name to get the values of 
> different fields. For example, `hadoop2` is the classifier of 
> `avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of 
> `htrace-core-3.1.0-incubating.jar`.
> Thus, I propose to enable dependency audit files to tell the value of 
> artifact id, version, and classifier of a dependency. For example, 
> `avro-mapred-1.8.2-hadoop2.jar` should be expanded to 
> `avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` 
> is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier.
> In this way, dependency audit files are able to be consumed by automated 
> tests or downstream tools. There is a good example of the downstream tool 
> that would be enabled:
> Say we have a Spark application that depends on a third-party dependency 
> `foo`, which pulls in `jackson` as a transient dependency. Unfortunately, 
> `foo` depends on a different version of `jackson` than Spark. So, in the pom 
> of this Spark application, we use the dependency management section to pin 
> the version of `jackson`. By doing this, we are lifting `jackson` to the 
> top-level dependency of my application and I want to have a way to keep 
> tracking what Spark uses. What we can do is to cross-check my Spark 
> application's classpath with what Spark uses. Then, with a test written in my 
> code base, whenever my application bumps Spark version, this test will check 
> what we define in the application and what Spark has, and then remind us to 
> change our application's pom if needed. In my case, I am fine to directly 
> access git to get these audit files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: i

[jira] [Created] (SPARK-30532) DataFrameStatFunctions.approxQuantile doesn't work with TABLE.COLUMN syntax

2020-01-16 Thread Chris Suchanek (Jira)
Chris Suchanek created SPARK-30532:
--

 Summary: DataFrameStatFunctions.approxQuantile doesn't work with 
TABLE.COLUMN syntax
 Key: SPARK-30532
 URL: https://issues.apache.org/jira/browse/SPARK-30532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Chris Suchanek


The DataFrameStatFunctions.approxQuantile doesn't work with fully qualified 
column name (i.e TABLE_NAME.COLUMN_NAME) which is often the way you refer to 
the column when working with joined dataframes having ambiguous column names.


See code below for example.
{code:java}

import scala.util.Random
val l = (0 to 1000).map(_ => Random.nextGaussian() * 1000)
val df1 = sc.parallelize(l).toDF("num").as("tt1")
val df2 = sc.parallelize(l).toDF("num").as("tt2")
val dfx = df2.crossJoin(df1)

dfx.stat.approxQuantile("tt1.num", Array(0.1), 0.0)
// throws: java.lang.IllegalArgumentException: Field "tt1.num" does not exist.
Available fields: num

dfx.stat.approxQuantile("num", Array(0.1), 0.0)
// throws: org.apache.spark.sql.AnalysisException: Reference 'num' is 
ambiguous, could be: tt2.num, tt1.num.;{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels

2020-01-16 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-30533:
--

 Summary: Add classes to represent Java Regressors and 
RegressionModels
 Key: SPARK-30533
 URL: https://issues.apache.org/jira/browse/SPARK-30533
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


Right now PySpark provides classed representing Java {{Classifiers}} and 
{{ClassifierModels}}, but lacks their regression counterparts.

We should provide these for consistency, feature parity and as prerequisite for 
SPARK-29212.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2020-01-16 Thread Samuel Shepard (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017438#comment-17017438
 ] 

Samuel Shepard commented on SPARK-6235:
---

[~irashid] I followed your suggestion of looking in the user archive and [found 
an old PR |https://github.com/apache/spark/pull/17907] that tried to fix the 
PCA call itself.  It was closed, but I linked it back here. [~srowen] is also 
on the thread. I leave this comment to help direct users to a workaround as 
much to encourage a future fix.

Thanks for all you guys do.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-16 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-30512:
--
Comment: was deleted

(was: [https://github.com/apache/spark/pull/27240])

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-16 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017447#comment-17017447
 ] 

Chandni Singh commented on SPARK-30512:
---

[https://github.com/apache/spark/pull/27240]

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-16 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017448#comment-17017448
 ] 

t oo commented on SPARK-27750:
--

yes, I hit this. 'spark standalone' is the RM, it has no pools. I am thinking 
one way to do is new config, cores_reserved_for_apps=8 (changeable), then those 
8 cores cannot be consumed by drivers

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29708) Different answers in aggregates of duplicate grouping sets

2020-01-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29708:
--
Fix Version/s: 2.4.5

> Different answers in aggregates of duplicate grouping sets
> --
>
> Key: SPARK-29708
> URL: https://issues.apache.org/jira/browse/SPARK-29708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> A query below with multiple grouping sets seems to have different answers 
> between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), 
> unsortable_col xid);
> postgres=# insert into gstest4
> postgres-# values (1,1,b'','1'), (2,2,b'0001','1'),
> postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'),
> postgres-#(5,16,b'','2'), (6,32,b'0001','2'),
> postgres-#(7,64,b'0010','1'), (8,128,b'0011','1');
> INSERT 0 8
> postgres=# select unsortable_col, count(*)
> postgres-#   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
> postgres-#   order by text(unsortable_col);
>  unsortable_col | count 
> +---
>   1 | 8
>   1 | 8
>   2 | 8
>   2 | 8
> (4 rows)
> {code}
> {code:java}
> scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* 
> bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""")
> scala> sql("""
>  | insert into gstest4
>  | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1),
>  |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2),
>  |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2),
>  |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1)
>  | """)
> res21: org.apache.spark.sql.DataFrame = []
> scala> 
> scala> sql("""
>  | select unsortable_col, count(*)
>  |   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
>  |   order by string(unsortable_col)
>  | """).show
> +--++
> |unsortable_col|count(1)|
> +--++
> | 1|   8|
> | 2|   8|
> +--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels

2020-01-16 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30533:
---
Parent: SPARK-28958
Issue Type: Sub-task  (was: Bug)

> Add classes to represent Java Regressors and RegressionModels
> -
>
> Key: SPARK-30533
> URL: https://issues.apache.org/jira/browse/SPARK-30533
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Right now PySpark provides classed representing Java {{Classifiers}} and 
> {{ClassifierModels}}, but lacks their regression counterparts.
> We should provide these for consistency, feature parity and as prerequisite 
> for SPARK-29212.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29708) Different answers in aggregates of duplicate grouping sets

2020-01-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29708:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4

> Different answers in aggregates of duplicate grouping sets
> --
>
> Key: SPARK-29708
> URL: https://issues.apache.org/jira/browse/SPARK-29708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> A query below with multiple grouping sets seems to have different answers 
> between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), 
> unsortable_col xid);
> postgres=# insert into gstest4
> postgres-# values (1,1,b'','1'), (2,2,b'0001','1'),
> postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'),
> postgres-#(5,16,b'','2'), (6,32,b'0001','2'),
> postgres-#(7,64,b'0010','1'), (8,128,b'0011','1');
> INSERT 0 8
> postgres=# select unsortable_col, count(*)
> postgres-#   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
> postgres-#   order by text(unsortable_col);
>  unsortable_col | count 
> +---
>   1 | 8
>   1 | 8
>   2 | 8
>   2 | 8
> (4 rows)
> {code}
> {code:java}
> scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* 
> bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""")
> scala> sql("""
>  | insert into gstest4
>  | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1),
>  |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2),
>  |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2),
>  |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1)
>  | """)
> res21: org.apache.spark.sql.DataFrame = []
> scala> 
> scala> sql("""
>  | select unsortable_col, count(*)
>  |   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
>  |   order by string(unsortable_col)
>  | """).show
> +--++
> |unsortable_col|count(1)|
> +--++
> | 1|   8|
> | 2|   8|
> +--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29708) Different answers in aggregates of duplicate grouping sets

2020-01-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017478#comment-17017478
 ] 

Dongjoon Hyun commented on SPARK-29708:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/27229 
.

> Different answers in aggregates of duplicate grouping sets
> --
>
> Key: SPARK-29708
> URL: https://issues.apache.org/jira/browse/SPARK-29708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> A query below with multiple grouping sets seems to have different answers 
> between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), 
> unsortable_col xid);
> postgres=# insert into gstest4
> postgres-# values (1,1,b'','1'), (2,2,b'0001','1'),
> postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'),
> postgres-#(5,16,b'','2'), (6,32,b'0001','2'),
> postgres-#(7,64,b'0010','1'), (8,128,b'0011','1');
> INSERT 0 8
> postgres=# select unsortable_col, count(*)
> postgres-#   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
> postgres-#   order by text(unsortable_col);
>  unsortable_col | count 
> +---
>   1 | 8
>   1 | 8
>   2 | 8
>   2 | 8
> (4 rows)
> {code}
> {code:java}
> scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* 
> bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""")
> scala> sql("""
>  | insert into gstest4
>  | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1),
>  |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2),
>  |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2),
>  |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1)
>  | """)
> res21: org.apache.spark.sql.DataFrame = []
> scala> 
> scala> sql("""
>  | select unsortable_col, count(*)
>  |   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
>  |   order by string(unsortable_col)
>  | """).show
> +--++
> |unsortable_col|count(1)|
> +--++
> | 1|   8|
> | 2|   8|
> +--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29950) Deleted excess executors can connect back to driver in K8S with dyn alloc on

2020-01-16 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29950.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26586
[https://github.com/apache/spark/pull/26586]

> Deleted excess executors can connect back to driver in K8S with dyn alloc on
> 
>
> Key: SPARK-29950
> URL: https://issues.apache.org/jira/browse/SPARK-29950
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> {{ExecutorPodsAllocator}} currently has code to delete excess pods that the 
> K8S server hasn't started yet, and aren't needed anymore due to downscaling.
> The problem is that there is a race between K8S starting the pod and the 
> Spark code deleting it. This may cause the pod to connect back to Spark and 
> do a lot of initialization, sometimes even being considered for task 
> allocation, just to be killed almost immediately.
> This doesn't cause any problems that I could detect in my tests, but wastes 
> resources, and causes logs to contains misleading messages about the executor 
> being killed. It would be nice to avoid that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29950) Deleted excess executors can connect back to driver in K8S with dyn alloc on

2020-01-16 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29950:
--

Assignee: Marcelo Masiero Vanzin

> Deleted excess executors can connect back to driver in K8S with dyn alloc on
> 
>
> Key: SPARK-29950
> URL: https://issues.apache.org/jira/browse/SPARK-29950
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>
> {{ExecutorPodsAllocator}} currently has code to delete excess pods that the 
> K8S server hasn't started yet, and aren't needed anymore due to downscaling.
> The problem is that there is a race between K8S starting the pod and the 
> Spark code deleting it. This may cause the pod to connect back to Spark and 
> do a lot of initialization, sometimes even being considered for task 
> allocation, just to be killed almost immediately.
> This doesn't cause any problems that I could detect in my tests, but wastes 
> resources, and causes logs to contains misleading messages about the executor 
> being killed. It would be nice to avoid that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-16 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30534:
-

 Summary: Use mvn in `dev/scalastyle`
 Key: SPARK-30534
 URL: https://issues.apache.org/jira/browse/SPARK-30534
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.4.4, 3.0.0
Reporter: Dongjoon Hyun


This issue aims to use `mvn` instead of `sbt`.

As of now, Apache Spark sbt build is broken by the Maven Central repository 
policy.

- 
https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an

> Effective January 15, 2020, The Central Maven Repository no longer supports 
> insecure
> communication over plain HTTP and requires that all requests to the 
> repository are 
> encrypted over HTTPS.

We can reproduce this locally by the following.

$ rm -rf ~/.m2/repository/org/apache/apache/18/
$ build/sbt clean

This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30535) Migrate ALTER TABLE commands to the new resolution framework

2020-01-16 Thread Terry Kim (Jira)
Terry Kim created SPARK-30535:
-

 Summary: Migrate ALTER TABLE commands to the new resolution 
framework
 Key: SPARK-30535
 URL: https://issues.apache.org/jira/browse/SPARK-30535
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30535) Migrate ALTER TABLE commands to the new resolution framework

2020-01-16 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-30535:
--
Description: Migrate ALTER TABLE commands to the new resolution framework 
introduced in SPARK-30214

> Migrate ALTER TABLE commands to the new resolution framework
> 
>
> Key: SPARK-30535
> URL: https://issues.apache.org/jira/browse/SPARK-30535
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Migrate ALTER TABLE commands to the new resolution framework introduced in 
> SPARK-30214



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-01-16 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017558#comment-17017558
 ] 

Terry Kim commented on SPARK-30420:
---

This is resolved by:

https://github.com/apache/spark/pull/27095 and 
https://github.com/apache/spark/pull/27125

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30534:
-

Assignee: Dongjoon Hyun

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30534.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27242
[https://github.com/apache/spark/pull/27242]

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017615#comment-17017615
 ] 

Dongjoon Hyun commented on SPARK-27868:
---

Oh, do you want me revert back?
Or, I can make a follow-up PR to sync with master after the on-going PR is 
merged to the master.

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30536) Sort-merge join operator spilling performance improvements

2020-01-16 Thread Sinisa Knezevic (Jira)
Sinisa Knezevic created SPARK-30536:
---

 Summary: Sort-merge join operator spilling performance improvements
 Key: SPARK-30536
 URL: https://issues.apache.org/jira/browse/SPARK-30536
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.4, 2.4.3
Reporter: Sinisa Knezevic


Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs (example 
query 14) are not able to run even with extremely large Spark executor 
memory.Spark spilling feature has to be enabled, in order to be able to process 
these SQLs. Processing of SQLs becomes extremely slow when spilling is 
enabled.The Spark spilling feature is enabled via two parameters: 
“spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and 
“spark.sql.sortMergeJoinExec.buffer.spill.threshold”

“spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this threshold 
is reached, the data will be moved from ExternalAppendOnlyUnsafeRowArrey object 
into UnsafeExternalSorter object.

“spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is 
reached, the data will be spilled from UnsafeExternalSorter object onto the 
disk.

 

During execution of sort-merge join (Left Semi Join ) for each left join row 
“right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
object.In the case of Query 14 there are millions of rows of “right matches”. 
To run this query spilling is enabled and data is moved from 
ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled 
onto the disk.When million rows are processed on left side of the join, the 
iterator on top of spilled “right matches” rows is created each time. This 
means that millions of time iterator on top of right matches (that are spilled 
on the disk) is created.The current Spark implementation creates iterator on 
top of spilled rows and producing I/0 which results into millions of I/0 when 
million rows are processed.

 

To avoid the performance bottleneck this JIRA introducing following solution:

1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top 
of spilled rows:
    … During SortMergeJoin (Left Semi Join) execution, the iterator on the 
spill data is created but no iteration over the data is done.
   ... Having lazy initialization of UnsafeSorterSpillReader will enable 
efficient processing of SortMergeJoin even if data is spilled onto disk. 
Unnecessary I/O will be avoided.

2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from 1MB 
to 1KB:
    … UnsafeSorterSpillReader constructor takes lot of time due to size of 
default 1MB memory read buffer.
    … The code already has logic to increase the memory read buffer if it 
cannot fit the data, so decreasing the size to 1K is safe and has positive 
performance impact.

3. Improve memory utilization when spilling is enabled in 
ExternalAppendOnlyUnsafeRowArrey:

    … In the current implementation, when spilling is enabled, 
UnsafeExternalSorter object is created and then data moved from 
ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then 
ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before 
ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in 
the memory with the same data. That require double memory and there is 
duplication of data. This can be avoided.

    … In the proposed solution, when 
spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached  adding new 
rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter 
object is created and new rows are added into this object. 
ExternalAppendOnlyUnsafeRowArray object retains all rows already added into 
this object. This approach will enable better memory utilization and avoid 
unnecessary movement of data from one object into another.

 

The test of this solution with query 14 and enabled spilling on the disk, 
showed 500X performance improvements and it didn’t degrade performance of the 
other SQLs from TPC-DS benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30499) Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone

2020-01-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30499:


Assignee: Maxim Gekk

> Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone
> ---
>
> Key: SPARK-30499
> URL: https://issues.apache.org/jira/browse/SPARK-30499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The SQL config has been deprecated since Spark 2.3, and should be removed in 
> Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30499) Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone

2020-01-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30499.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27218
[https://github.com/apache/spark/pull/27218]

> Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone
> ---
>
> Key: SPARK-30499
> URL: https://issues.apache.org/jira/browse/SPARK-30499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The SQL config has been deprecated since Spark 2.3, and should be removed in 
> Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

2020-01-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30537:


 Summary: toPandas gets wrong dtypes when applied on empty DF when 
Arrow enabled
 Key: SPARK-30537
 URL: https://issues.apache.org/jira/browse/SPARK-30537
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Hyukjin Kwon


Same issue with SPARK-29188 persists when Arrow optimization is enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30538) A not very elegant way to control ouput small file

2020-01-16 Thread angerszhu (Jira)
angerszhu created SPARK-30538:
-

 Summary: A  not very elegant way to control ouput small file 
 Key: SPARK-30538
 URL: https://issues.apache.org/jira/browse/SPARK-30538
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30539) DataFrame.tail in PySpark API

2020-01-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30539:


 Summary: DataFrame.tail in PySpark API
 Key: SPARK-30539
 URL: https://issues.apache.org/jira/browse/SPARK-30539
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


SPARK-30185 added DataFrame.tail API. It should be good for PySpark side to 
have it too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30449) Introducing get_dummies method in pyspark

2020-01-16 Thread Krishna Kumar Tiwari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Kumar Tiwari updated SPARK-30449:
-
Issue Type: New Feature  (was: Task)

> Introducing get_dummies method in pyspark
> -
>
> Key: SPARK-30449
> URL: https://issues.apache.org/jira/browse/SPARK-30449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Krishna Kumar Tiwari
>Priority: Major
>
> Introducing get_dummies method in pyspark same as pandas.
> Many times when using categorical variable and we want to flatten the data to 
> do one-hot encoding to generate columns and fill the matrix, get_dummies is 
> very useful in that scenario.
>  
> The objective here is to introduce get_dummies to pyspark.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29231) Constraints should be inferred from cast equality constraint

2020-01-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29231:

Summary: Constraints should be inferred from cast equality constraint  
(was: Can not infer an additional set of constraints if contains CAST)

> Constraints should be inferred from cast equality constraint
> 
>
> Key: SPARK-29231
> URL: https://issues.apache.org/jira/browse/SPARK-29231
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> scala> spark.sql("create table t1(c11 int, c12 decimal) ")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create table t2(c21 bigint, c22 decimal) ")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1").explain
> == Physical Plan ==
> SortMergeJoin [cast(c11#0 as bigint)], [c21#2L], LeftOuter
> :- *(2) Sort [cast(c11#0 as bigint) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(cast(c11#0 as bigint), 200), true, [id=#30]
> : +- *(1) Filter (isnotnull(c11#0) AND (c11#0 = 1))
> :+- Scan hive default.t1 [c11#0, c12#1], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c11#0, 
> c12#1], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort [c21#2L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c21#2L, 200), true, [id=#37]
>   +- *(3) Filter isnotnull(c21#2L)
>  +- Scan hive default.t2 [c21#2L, c22#3], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c21#2L, 
> c22#3], Statistics(sizeInBytes=8.0 EiB)
> {code}
> PostgreSQL suport this feature:
> {code:sql}
> postgres=# create table t1(c11 int4, c12 decimal);
> CREATE TABLE
> postgres=# create table t2(c21 int8, c22 decimal);
> CREATE TABLE
> postgres=# explain select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1;
>QUERY PLAN
> 
>  Nested Loop Left Join  (cost=0.00..51.43 rows=36 width=76)
>Join Filter: (t1.c11 = t2.c21)
>->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
>  Filter: (c11 = 1)
>->  Materialize  (cost=0.00..25.03 rows=6 width=40)
>  ->  Seq Scan on t2  (cost=0.00..25.00 rows=6 width=40)
>Filter: (c21 = 1)
> (7 rows)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29572) add v1 read fallback API in DS v2

2020-01-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29572.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26231
[https://github.com/apache/spark/pull/26231]

> add v1 read fallback API in DS v2
> -
>
> Key: SPARK-29572
> URL: https://issues.apache.org/jira/browse/SPARK-29572
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30325:
--
Affects Version/s: 2.4.1
   2.4.2
   2.4.3

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: haiyangyu
>Assignee: haiyangyu
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017758#comment-17017758
 ] 

Dongjoon Hyun commented on SPARK-30325:
---

Hi, All.
Since there is no UT, it's difficult to validate. According to the commit 
history, this bug seems to exist with SPARK-24755 and before it. Could you 
update `Affects Version/s:` more? For example, 2.3.4 and 2.2.3?

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: haiyangyu
>Assignee: haiyangyu
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017758#comment-17017758
 ] 

Dongjoon Hyun edited comment on SPARK-30325 at 1/17/20 7:15 AM:


Hi, All.
Since there is no UT, it's difficult to validate. According to the commit 
history, this bug seems to exist with SPARK-24755 and before it. Could you 
update `Affects Version/s:` more if it does? For example, 2.3.4 and 2.2.3?


was (Author: dongjoon):
Hi, All.
Since there is no UT, it's difficult to validate. According to the commit 
history, this bug seems to exist with SPARK-24755 and before it. Could you 
update `Affects Version/s:` more? For example, 2.3.4 and 2.2.3?

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: haiyangyu
>Assignee: haiyangyu
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30540) HistoryServer application link is incorrect when one application having multiple attempts

2020-01-16 Thread Junfan Zhang (Jira)
Junfan Zhang created SPARK-30540:


 Summary: HistoryServer application link is incorrect when one 
application having multiple attempts
 Key: SPARK-30540
 URL: https://issues.apache.org/jira/browse/SPARK-30540
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.4
Reporter: Junfan Zhang


Code is 
[here|https://github.com/apache/spark/blob/0a1e01c30e2c00d079c7a135a11a5664928c5c1c/core/src/main/resources/org/apache/spark/ui/static/historypage-template.html#L81].

When having multiple attempts, the applicaiton link is incorrect. No matter 
which attempt is clicked, the link still points to the largest attemptId.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-01-16 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-30541:


 Summary: Flaky test: 
org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
 Key: SPARK-30541
 URL: https://issues.apache.org/jira/browse/SPARK-30541
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


The test suite has been failing intermittently as of now:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]

 

org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it is 
a sbt.testing.SuiteSelector)
  
{noformat}
Error Details
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 3939 times over 
1.000122353532 minutes. Last failure message: KeeperErrorCode = AuthFailed 
for /brokers/ids.

Stack Trace
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 3939 times over 
1.000122353532 minutes. Last failure message: KeeperErrorCode = AuthFailed 
for /brokers/ids.
at 
org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
at 
org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
at 
org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: sbt.ForkMain$ForkError: 
org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
AuthFailed for /brokers/ids
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at 
kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
at 
kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
at 
org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
at 
org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
at 
org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
... 20 more
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-16 Thread Sivakumar (Jira)
Sivakumar created SPARK-30542:
-

 Summary: Two Spark structured streaming jobs cannot write to same 
base path
 Key: SPARK-30542
 URL: https://issues.apache.org/jira/browse/SPARK-30542
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Sivakumar


Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark__metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already __spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-16 Thread Sivakumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sivakumar updated SPARK-30542:
--
Description: 
Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 

  was:
Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark__metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already __spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 


> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> I have two structured streaming jobs which should write data to the same base 
> directory.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30543) RandomForest add Param bootstrap to control sampling method

2020-01-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30543:


 Summary: RandomForest add Param bootstrap to control sampling 
method
 Key: SPARK-30543
 URL: https://issues.apache.org/jira/browse/SPARK-30543
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Current RF with numTrees=1 will directly build a tree using the orignial 
dataset,

while with numTrees>1 it will use bootstrap samples to build trees.

This design is to train a DecisionTreeModel by the impl of RandomForest, 
however, it is somewhat strange.

In Scikit-Learn, there is a param bootstrap to control bootstrap samples are 
used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29231) Constraints should be inferred from cast equality constraint

2020-01-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29231:

Comment: was deleted

(was: I'm working on.)

> Constraints should be inferred from cast equality constraint
> 
>
> Key: SPARK-29231
> URL: https://issues.apache.org/jira/browse/SPARK-29231
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> scala> spark.sql("create table t1(c11 int, c12 decimal) ")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create table t2(c21 bigint, c22 decimal) ")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1").explain
> == Physical Plan ==
> SortMergeJoin [cast(c11#0 as bigint)], [c21#2L], LeftOuter
> :- *(2) Sort [cast(c11#0 as bigint) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(cast(c11#0 as bigint), 200), true, [id=#30]
> : +- *(1) Filter (isnotnull(c11#0) AND (c11#0 = 1))
> :+- Scan hive default.t1 [c11#0, c12#1], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c11#0, 
> c12#1], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort [c21#2L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c21#2L, 200), true, [id=#37]
>   +- *(3) Filter isnotnull(c21#2L)
>  +- Scan hive default.t2 [c21#2L, c22#3], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c21#2L, 
> c22#3], Statistics(sizeInBytes=8.0 EiB)
> {code}
> PostgreSQL suport this feature:
> {code:sql}
> postgres=# create table t1(c11 int4, c12 decimal);
> CREATE TABLE
> postgres=# create table t2(c21 int8, c22 decimal);
> CREATE TABLE
> postgres=# explain select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1;
>QUERY PLAN
> 
>  Nested Loop Left Join  (cost=0.00..51.43 rows=36 width=76)
>Join Filter: (t1.c11 = t2.c21)
>->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
>  Filter: (c11 = 1)
>->  Materialize  (cost=0.00..25.03 rows=6 width=40)
>  ->  Seq Scan on t2  (cost=0.00..25.00 rows=6 width=40)
>Filter: (c21 = 1)
> (7 rows)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-16 Thread Sivakumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sivakumar updated SPARK-30542:
--
Description: 
Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

Is it possible to create the __spark__metadata directory else where or disable 
without any data loss.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 

  was:
Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 


> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> I have two structured streaming jobs which should write data to the same base 
> directory.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org