[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-07-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544825#comment-16544825
 ] 

Xiao Li commented on SPARK-24498:
-

[~kiszk] Based on my initial understanding, the code generated by the JDK 
compiler can be better optimized by JIT in many cases. Is my understanding 
right?

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-15 Thread James (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544820#comment-16544820
 ] 

James commented on SPARK-21097:
---

Hi, [~bradkaiser]

 

Could you please let me know what is the meaning of row processing time delay 
in your benchmark?

I want to know why when you set the processing time to 0 us, dynamic allocation 
without recovery is much worse than static allocation?

 

Thanks

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-15 Thread James (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James updated SPARK-21097:
--
Comment: was deleted

(was: Hi, [~bradkaiser]

 

Could you please let me know what is the meaning of row processing time delay 
in your benchmark?

 

Thanks)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-15 Thread James (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544816#comment-16544816
 ] 

James commented on SPARK-21097:
---

Hi, [~bradkaiser]

 

Could you please let me know what is the meaning of row processing time delay 
in your benchmark?

 

Thanks

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-15 Thread donglin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

donglin updated SPARK-21097:

Comment: was deleted

(was: Hi, [~bradkaiser]

 

Could you please let me know what is the meaning of row processing time delay 
in your benchmark?

 

Thanks)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-15 Thread donglin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544815#comment-16544815
 ] 

donglin commented on SPARK-21097:
-

Hi, [~bradkaiser]

 

Could you please let me know what is the meaning of row processing time delay 
in your benchmark?

 

Thanks

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
SQL interface support {{repartitionByRange}} to improvement data pushdown .I 
have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feather with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support Improvement data pushdown by .I have test this feather 
with a big table(data size: 1.1 T, row count: 282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
I have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQL interface support Improvement data pushdown by .I have test this feather 
> with a big table(data size: 1.1 T, row count: 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support {{repartitionByRange}} to improvement data pushdown .I 
have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
SQL interface support Improvement data pushdown by .I have test this feather 
with a big table(data size: 1.1 T, row count: 282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown .I 
> have test this feather with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Summary: SQL interface support repartitionByRange  (was: Improvement data 
pushdown by repartitionByRange)

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> I have test this feather with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24816) SQL interface support repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544813#comment-16544813
 ] 

Yuming Wang commented on SPARK-24816:
-

I'm working on.

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> I have test this feather with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24816) Improvement data pushdown by repartitionByRange

2018-07-15 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-24816:
---

 Summary: Improvement data pushdown by repartitionByRange
 Key: SPARK-24816
 URL: https://issues.apache.org/jira/browse/SPARK-24816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


I have test this feather with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24813.
--
Resolution: Fixed

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24813:
-
Fix Version/s: 2.2.3

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled

2018-07-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24676.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Project required data from parsed data when csvColumnPruning disabled
> -
>
> Key: SPARK-24676
> URL: https://issues.apache.org/jira/browse/SPARK-24676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> I hit a bug below when parsing csv data;
> {code}
> ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
> scala> val dir = "/tmp/spark-csv/csv"
> scala> spark.range(10).selectExpr("id % 2 AS p", 
> "id").write.mode("overwrite").partitionBy("p").csv(dir)
> scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
> 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23259) Clean up legacy code around hive external catalog

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544754#comment-16544754
 ] 

Apache Spark commented on SPARK-23259:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21780

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Priority: Major
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544734#comment-16544734
 ] 

Apache Spark commented on SPARK-24813:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21779

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20220) Add thrift scheduling pool config in scheduling docs

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544729#comment-16544729
 ] 

Apache Spark commented on SPARK-20220:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21778

> Add thrift scheduling pool config in scheduling docs
> 
>
> Key: SPARK-20220
> URL: https://issues.apache.org/jira/browse/SPARK-20220
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Miklos Christine
>Priority: Trivial
>
> Spark 1.2 docs document the thrift job scheduling pool. 
> https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
> This configuration is no longer documented in the 2.x documentation. 
> Adding this back to the job scheduling docs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24813:
-
Fix Version/s: 2.4.0
   2.3.2

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24815) Structured Streaming should support dynamic allocation

2018-07-15 Thread Karthik Palaniappan (JIRA)
Karthik Palaniappan created SPARK-24815:
---

 Summary: Structured Streaming should support dynamic allocation
 Key: SPARK-24815
 URL: https://issues.apache.org/jira/browse/SPARK-24815
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Structured Streaming
Affects Versions: 2.3.1
Reporter: Karthik Palaniappan


Dynamic allocation is very useful for adding and removing containers to match 
the actual workload. On multi-tenant clusters, it ensures that a Spark job is 
taking no more resources than necessary. In cloud environments, it enables 
autoscaling.

However, if you set spark.dynamicAllocation.enabled=true and run a structured 
streaming job, Core's dynamic allocation algorithm kicks in. It requests 
executors if the task backlog is a certain size, and remove executors if they 
idle for a certain period of time.

This does not make sense for streaming jobs, as outlined in 
https://issues.apache.org/jira/browse/SPARK-12133, which introduced dynamic 
allocation for the old streaming API.

First, Spark should print a warning if you run a structured streaming job when 
Core's dynamic allocation is enabled

Second, structured streaming should have support for dynamic allocation. It 
would be convenient if it were the same set of properties as Core's dynamic 
allocation, but I don't have a strong opinion on that.

If somebody can give me pointers on how to add dynamic allocation support, I'd 
be happy to take a stab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24814) Relationship between catalog and datasources

2018-07-15 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-24814:
-

 Summary: Relationship between catalog and datasources
 Key: SPARK-24814
 URL: https://issues.apache.org/jira/browse/SPARK-24814
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bruce Robbins


This is somewhat related, though not identical to, Ryan Blue's SPIP on 
datasources and catalogs.

Here are the requirements (IMO) for fully implementing V2 datasources and their 
relationships to catalogs:
 # The global catalog should be configurable (the default can be HMS, but it 
should be overridable).
 # The default catalog (or an explicitly specified catalog in a query, once 
multiple catalogs are supported) can determine the V2 datasource to use for 
reading and writing the data.
 # Conversely, a V2 datasource can determine which catalog to use for 
resolution (e.g., if the user issues 
{{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24498) Add JDK compiler for runtime codegen

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24498:


Assignee: Apache Spark

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544714#comment-16544714
 ] 

Apache Spark commented on SPARK-24498:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21777

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24498) Add JDK compiler for runtime codegen

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24498:


Assignee: (was: Apache Spark)

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-07-15 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544703#comment-16544703
 ] 

Bryan Cutler commented on SPARK-24632:
--

Hi [~josephkb], would you mind clarifying why there needs to be an additional 
trait in Scala to point to Python class paths, instead of something to override 
the line
{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark", 
"pyspark")
{code}
in wrapper.py?  Ideally the Scala classes should not be aware of the Python, 
and when loading, the Python esitmators/models should be able to create the 
Java object and wrap it as long as the line above has the correct class prefix? 
 Thanks!

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544690#comment-16544690
 ] 

Apache Spark commented on SPARK-24813:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21776

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24813:


Assignee: Sean Owen  (was: Apache Spark)

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24813:


Assignee: Apache Spark  (was: Sean Owen)

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-15 Thread Sean Owen (JIRA)
Sean Owen created SPARK-24813:
-

 Summary: HiveExternalCatalogVersionsSuite still flaky; fall back 
to Apache archive
 Key: SPARK-24813
 URL: https://issues.apache.org/jira/browse/SPARK-24813
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.1, 2.2.2, 2.1.3
Reporter: Sean Owen
Assignee: Sean Owen


HiveExternalCatalogVersionsSuite is still failing periodically with errors from 
mirror sites. In fact, the test depends on the Spark versions it needs being 
available on the mirrors, but older versions will eventually be removed.

The test should fall back to downloading from archive.apache.org if mirrors 
don't have the Spark release, or aren't responding.

This has become urgent as I helpfully already purged many old Spark releases 
from mirrors, as requested by the ASF, before realizing it would probably make 
this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2018-07-15 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24624:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-15 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-15 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544656#comment-16544656
 ] 

Li Jin commented on SPARK-24721:


I am currently traveling but will try to take a look when I get back

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> 

[jira] [Updated] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot

2018-07-15 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24796:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Support GROUPED_AGG_PANDAS_UDF in Pivot
> ---
>
> Key: SPARK-24796
> URL: https://issues.apache.org/jira/browse/SPARK-24796
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to 
> support it. 
> {code}
> # create input dataframe
> from pyspark.sql import Row
> data = [
>   Row(id=123, total=200.0, qty=3, name='item1'),
>   Row(id=124, total=1500.0, qty=1, name='item2'),
>   Row(id=125, total=203.5, qty=2, name='item3'),
>   Row(id=126, total=200.0, qty=500, name='item1'),
> ]
> df = spark.createDataFrame(data)
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def pandas_avg(v):
>return v.mean()
> from pyspark.sql.functions import col, sum
>   
> applied_df = 
> df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean'))
> applied_df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot

2018-07-15 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544655#comment-16544655
 ] 

Li Jin commented on SPARK-24796:


Sorry I am traveling now but I will try to take a look when I get back

> Support GROUPED_AGG_PANDAS_UDF in Pivot
> ---
>
> Key: SPARK-24796
> URL: https://issues.apache.org/jira/browse/SPARK-24796
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to 
> support it. 
> {code}
> # create input dataframe
> from pyspark.sql import Row
> data = [
>   Row(id=123, total=200.0, qty=3, name='item1'),
>   Row(id=124, total=1500.0, qty=1, name='item2'),
>   Row(id=125, total=203.5, qty=2, name='item3'),
>   Row(id=126, total=200.0, qty=500, name='item1'),
> ]
> df = spark.createDataFrame(data)
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def pandas_avg(v):
>return v.mean()
> from pyspark.sql.functions import col, sum
>   
> applied_df = 
> df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean'))
> applied_df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24812:


Assignee: Apache Spark

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Assignee: Apache Spark
>Priority: Minor
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544651#comment-16544651
 ] 

Apache Spark commented on SPARK-24812:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/21775

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Priority: Minor
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24812:


Assignee: (was: Apache Spark)

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Priority: Minor
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24769) Support for parsing AVRO binary column

2018-07-15 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-24769.

Resolution: Duplicate

> Support for parsing AVRO binary column
> --
>
> Key: SPARK-24769
> URL: https://issues.apache.org/jira/browse/SPARK-24769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24770) Supporting to convert a column into binary of AVRO format

2018-07-15 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-24770:
---
Comment: was deleted

(was: The function `from_avro` and `to_avro` can be added in one PR:
 # The code is similar as `from_json` and `to_json`
 # Putting them together makes the unit test implementation easier.

So I decide to close this issue and use 
https://issues.apache.org/jira/browse/SPARK-24811 instead)

> Supporting to convert a column into binary of AVRO format
> -
>
> Key: SPARK-24770
> URL: https://issues.apache.org/jira/browse/SPARK-24770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24770) Supporting to convert a column into binary of AVRO format

2018-07-15 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-24770.

Resolution: Duplicate

The function `from_avro` and `to_avro` can be added in one PR:
 # The code is similar as `from_json` and `to_json`
 # Putting them together makes the unit test implementation easier.

So I decide to close this issue and use 
https://issues.apache.org/jira/browse/SPARK-24811 instead

> Supporting to convert a column into binary of AVRO format
> -
>
> Key: SPARK-24770
> URL: https://issues.apache.org/jira/browse/SPARK-24770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24769) Support for parsing AVRO binary column

2018-07-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544640#comment-16544640
 ] 

Gengliang Wang commented on SPARK-24769:


[~felipesmmelo] Thank you. But I have created a PR: 
https://github.com/apache/spark/pull/21774

> Support for parsing AVRO binary column
> --
>
> Key: SPARK-24769
> URL: https://issues.apache.org/jira/browse/SPARK-24769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24770) Supporting to convert a column into binary of AVRO format

2018-07-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544639#comment-16544639
 ] 

Gengliang Wang commented on SPARK-24770:


[~felipesmmelo] Thank you. But I have created a PR: 
https://github.com/apache/spark/pull/21774

> Supporting to convert a column into binary of AVRO format
> -
>
> Key: SPARK-24770
> URL: https://issues.apache.org/jira/browse/SPARK-24770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24769) Support for parsing AVRO binary column

2018-07-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544636#comment-16544636
 ] 

Gengliang Wang commented on SPARK-24769:


The function `from_avro` and `to_avro` can be added in one PR:
 # The code is similar as `from_json` and `to_json`
 # Putting them together makes the unit test implementation easier.

So I decide to close this issue and use 
https://issues.apache.org/jira/browse/SPARK-24811 instead.

> Support for parsing AVRO binary column
> --
>
> Key: SPARK-24769
> URL: https://issues.apache.org/jira/browse/SPARK-24769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24770) Supporting to convert a column into binary of AVRO format

2018-07-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544637#comment-16544637
 ] 

Gengliang Wang commented on SPARK-24770:


The function `from_avro` and `to_avro` can be added in one PR:
 # The code is similar as `from_json` and `to_json`
 # Putting them together makes the unit test implementation easier.

So I decide to close this issue and use 
https://issues.apache.org/jira/browse/SPARK-24811 instead.

> Supporting to convert a column into binary of AVRO format
> -
>
> Key: SPARK-24770
> URL: https://issues.apache.org/jira/browse/SPARK-24770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-15 Thread Sujith (JIRA)
Sujith created SPARK-24812:
--

 Summary: Last Access Time in the table description is not valid
 Key: SPARK-24812
 URL: https://issues.apache.org/jira/browse/SPARK-24812
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.1
Reporter: Sujith


Last Access Time in the table description is not valid, 

Test steps:

Step 1 -  create a table

Step 2 - Run  command "DESC FORMATTED table"

 Last Access Time will always displayed wrong date

Wed Dec 31 15:59:59 PST 1969 - which is wrong.

In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong 
date.

Seems to be a limitation as of now, better we can follow the hive behavior in 
this scenario.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544634#comment-16544634
 ] 

Apache Spark commented on SPARK-24811:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21774

> Add function `from_avro` and `to_avro`
> --
>
> Key: SPARK-24811
> URL: https://issues.apache.org/jira/browse/SPARK-24811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24811:


Assignee: (was: Apache Spark)

> Add function `from_avro` and `to_avro`
> --
>
> Key: SPARK-24811
> URL: https://issues.apache.org/jira/browse/SPARK-24811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24811:


Assignee: Apache Spark

> Add function `from_avro` and `to_avro`
> --
>
> Key: SPARK-24811
> URL: https://issues.apache.org/jira/browse/SPARK-24811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-15 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-24811:
--

 Summary: Add function `from_avro` and `to_avro`
 Key: SPARK-24811
 URL: https://issues.apache.org/jira/browse/SPARK-24811
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


Add a new function from_avro for parsing a binary column of avro format and 
converting it into its corresponding catalyst value.

Add a new function to_avro for converting a column into binary of avro format 
with the specified schema.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils

2018-07-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24457:
--
Priority: Minor  (was: Major)

> Performance improvement while converting stringToTimestamp in DateTimeUtils
> ---
>
> Key: SPARK-24457
> URL: https://issues.apache.org/jira/browse/SPARK-24457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sharad Sonker
>Priority: Minor
>
> stringToTimestamp in DateTimeUtils creates Calendar instance for each input 
> row even if the input timezone is same. This can be improved by caching the 
> calendar instance for each input timezone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544558#comment-16544558
 ] 

Apache Spark commented on SPARK-24810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21773

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24810:


Assignee: Apache Spark

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24810:


Assignee: (was: Apache Spark)

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24810:
---
Attachment: Screen Shot 2018-07-15 at 15.28.13.png

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24810:
--

 Summary: Fix paths to resource files in AvroSuite
 Key: SPARK-24810
 URL: https://issues.apache.org/jira/browse/SPARK-24810
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently paths to tests files from resource folder are relative in AvroSuite. 
It causes problems like impossibility for running tests from IDE. Need to wrap 
test files by:
{code:scala}
def testFile(fileName: String): String = {
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24800) Refactor Avro Serializer and Deserializer

2018-07-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24800.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21762
[https://github.com/apache/spark/pull/21762]

> Refactor Avro Serializer and Deserializer
> -
>
> Key: SPARK-24800
> URL: https://issues.apache.org/jira/browse/SPARK-24800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24800) Refactor Avro Serializer and Deserializer

2018-07-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24800:
---

Assignee: Gengliang Wang

> Refactor Avro Serializer and Deserializer
> -
>
> Key: SPARK-24800
> URL: https://issues.apache.org/jira/browse/SPARK-24800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24809:


Assignee: Apache Spark

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Assignee: Apache Spark
>Priority: Critical
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544534#comment-16544534
 ] 

Apache Spark commented on SPARK-24809:
--

User 'liutang123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21772

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24809:


Assignee: (was: Apache Spark)

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException

2018-07-15 Thread Daniel Mateus Pires (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544528#comment-16544528
 ] 

Daniel Mateus Pires commented on SPARK-24798:
-

+1 it's solved by using "Option"

> sortWithinPartitions(xx) will failed in java.lang.NullPointerException
> --
>
> Key: SPARK-24798
> URL: https://issues.apache.org/jira/browse/SPARK-24798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: shengyao piao
>Priority: Minor
>
> I have some issue in Spark 2.3 when I run bellow code in spark-shell or 
> spark-submit 
> I already figured out the reason of error is the name field contains 
> Some(null),
> But I believe this code will run successfully in Spark 2.2
> Is it an expected behavior in Spark 2.3 ?
>  
> ・Spark code
> {code}
> case class Hoge (id : Int,name : Option[String])
>  val ds = 
> spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", 
> "id").withColumnRenamed("_2", "name").map(row => 
> Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name"
>  
> ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty))
> {code}
> ・Error
> {code}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.isEmpty(Iterator.scala:330)
> at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-15 Thread Lijia Liu (JIRA)
Lijia Liu created SPARK-24809:
-

 Summary: Serializing LongHashedRelation in executor may result in 
data error
 Key: SPARK-24809
 URL: https://issues.apache.org/jira/browse/SPARK-24809
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0
 Environment: Spark 2.2.1

hadoop 2.7.1
Reporter: Lijia Liu


When join key is long or int in broadcast join, Spark will use 
LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if the 
broadcast value is abnormal big, executor will serialize it to disk. But, data 
will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2018-07-15 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544453#comment-16544453
 ] 

Li Yuanjian commented on SPARK-24295:
-

Could you give more detailed information about how the compact file size 
growing up to 10GB in your scenario? As the implementation of 
FileStreamSinkLog, batches in compactInterval(default value is 10) will be 
merged into a single file, all the content in this file is serialized 
SinkFileStatus, it seems hardly can grow to 10GB.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org