[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544825#comment-16544825 ] Xiao Li commented on SPARK-24498: - [~kiszk] Based on my initial understanding, the code generated by the JDK compiler can be better optimized by JIT in many cases. Is my understanding right? > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544820#comment-16544820 ] James commented on SPARK-21097: --- Hi, [~bradkaiser] Could you please let me know what is the meaning of row processing time delay in your benchmark? I want to know why when you set the processing time to 0 us, dynamic allocation without recovery is much worse than static allocation? Thanks > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James updated SPARK-21097: -- Comment: was deleted (was: Hi, [~bradkaiser] Could you please let me know what is the meaning of row processing time delay in your benchmark? Thanks) > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544816#comment-16544816 ] James commented on SPARK-21097: --- Hi, [~bradkaiser] Could you please let me know what is the meaning of row processing time delay in your benchmark? Thanks > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] donglin updated SPARK-21097: Comment: was deleted (was: Hi, [~bradkaiser] Could you please let me know what is the meaning of row processing time delay in your benchmark? Thanks) > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544815#comment-16544815 ] donglin commented on SPARK-21097: - Hi, [~bradkaiser] Could you please let me know what is the meaning of row processing time delay in your benchmark? Thanks > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: SQL interface support {{repartitionByRange}} to improvement data pushdown .I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feather with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support Improvement data pushdown by .I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > SQL interface support Improvement data pushdown by .I have test this feather > with a big table(data size: 1.1 T, row count: 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support {{repartitionByRange}} to improvement data pushdown .I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: SQL interface support Improvement data pushdown by .I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > SQL interface support {{repartitionByRange}} to improvement data pushdown .I > have test this feather with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Summary: SQL interface support repartitionByRange (was: Improvement data pushdown by repartitionByRange) > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > I have test this feather with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544813#comment-16544813 ] Yuming Wang commented on SPARK-24816: - I'm working on. > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > I have test this feather with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24816) Improvement data pushdown by repartitionByRange
Yuming Wang created SPARK-24816: --- Summary: Improvement data pushdown by repartitionByRange Key: SPARK-24816 URL: https://issues.apache.org/jira/browse/SPARK-24816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang I have test this feather with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24813. -- Resolution: Fixed > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24813: - Fix Version/s: 2.2.3 > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled
[ https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24676. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Project required data from parsed data when csvColumnPruning disabled > - > > Key: SPARK-24676 > URL: https://issues.apache.org/jira/browse/SPARK-24676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.4.0 > > > I hit a bug below when parsing csv data; > {code} > ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false > scala> val dir = "/tmp/spark-csv/csv" > scala> spark.range(10).selectExpr("id % 2 AS p", > "id").write.mode("overwrite").partitionBy("p").csv(dir) > scala> spark.read.csv(dir).selectExpr("sum(p)").collect() > 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544754#comment-16544754 ] Apache Spark commented on SPARK-23259: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21780 > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Priority: Major > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544734#comment-16544734 ] Apache Spark commented on SPARK-24813: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/21779 > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20220) Add thrift scheduling pool config in scheduling docs
[ https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544729#comment-16544729 ] Apache Spark commented on SPARK-20220: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21778 > Add thrift scheduling pool config in scheduling docs > > > Key: SPARK-20220 > URL: https://issues.apache.org/jira/browse/SPARK-20220 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Miklos Christine >Priority: Trivial > > Spark 1.2 docs document the thrift job scheduling pool. > https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md > This configuration is no longer documented in the 2.x documentation. > Adding this back to the job scheduling docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24813: - Fix Version/s: 2.4.0 2.3.2 > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24815) Structured Streaming should support dynamic allocation
Karthik Palaniappan created SPARK-24815: --- Summary: Structured Streaming should support dynamic allocation Key: SPARK-24815 URL: https://issues.apache.org/jira/browse/SPARK-24815 Project: Spark Issue Type: Improvement Components: Scheduler, Structured Streaming Affects Versions: 2.3.1 Reporter: Karthik Palaniappan Dynamic allocation is very useful for adding and removing containers to match the actual workload. On multi-tenant clusters, it ensures that a Spark job is taking no more resources than necessary. In cloud environments, it enables autoscaling. However, if you set spark.dynamicAllocation.enabled=true and run a structured streaming job, Core's dynamic allocation algorithm kicks in. It requests executors if the task backlog is a certain size, and remove executors if they idle for a certain period of time. This does not make sense for streaming jobs, as outlined in https://issues.apache.org/jira/browse/SPARK-12133, which introduced dynamic allocation for the old streaming API. First, Spark should print a warning if you run a structured streaming job when Core's dynamic allocation is enabled Second, structured streaming should have support for dynamic allocation. It would be convenient if it were the same set of properties as Core's dynamic allocation, but I don't have a strong opinion on that. If somebody can give me pointers on how to add dynamic allocation support, I'd be happy to take a stab. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24814) Relationship between catalog and datasources
Bruce Robbins created SPARK-24814: - Summary: Relationship between catalog and datasources Key: SPARK-24814 URL: https://issues.apache.org/jira/browse/SPARK-24814 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Bruce Robbins This is somewhat related, though not identical to, Ryan Blue's SPIP on datasources and catalogs. Here are the requirements (IMO) for fully implementing V2 datasources and their relationships to catalogs: # The global catalog should be configurable (the default can be HMS, but it should be overridable). # The default catalog (or an explicitly specified catalog in a query, once multiple catalogs are supported) can determine the V2 datasource to use for reading and writing the data. # Conversely, a V2 datasource can determine which catalog to use for resolution (e.g., if the user issues {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would decide which catalog to use for resolving “mytable”). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24498: Assignee: Apache Spark > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544714#comment-16544714 ] Apache Spark commented on SPARK-24498: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21777 > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24498: Assignee: (was: Apache Spark) > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544703#comment-16544703 ] Bryan Cutler commented on SPARK-24632: -- Hi [~josephkb], would you mind clarifying why there needs to be an additional trait in Scala to point to Python class paths, instead of something to override the line {code:java} stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark") {code} in wrapper.py? Ideally the Scala classes should not be aware of the Python, and when loading, the Python esitmators/models should be able to create the Java object and wrap it as long as the line above has the correct class prefix? Thanks! > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > I spent a bit thinking about this and wrote up thoughts and a proposal in the > doc linked below. Summary of proposal: > Require that 3rd-party libraries with Java classes with Python wrappers > implement a trait which provides the corresponding Python classpath in some > field: > {code} > trait PythonWrappable { > def pythonClassPath: String = … > } > MyJavaType extends PythonWrappable > {code} > This will not be required for MLlib wrappers, which we can handle specially. > One issue for this task will be that we may have trouble writing unit tests. > They would ideally test a Java class + Python wrapper class pair sitting > outside of pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544690#comment-16544690 ] Apache Spark commented on SPARK-24813: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/21776 > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24813: Assignee: Sean Owen (was: Apache Spark) > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24813: Assignee: Apache Spark (was: Sean Owen) > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Major > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
Sean Owen created SPARK-24813: - Summary: HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive Key: SPARK-24813 URL: https://issues.apache.org/jira/browse/SPARK-24813 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.1, 2.2.2, 2.1.3 Reporter: Sean Owen Assignee: Sean Owen HiveExternalCatalogVersionsSuite is still failing periodically with errors from mirror sites. In fact, the test depends on the Spark versions it needs being available on the mirrors, but older versions will eventually be removed. The test should fall back to downloading from archive.apache.org if mirrors don't have the Spark release, or aren't responding. This has become urgent as I helpfully already purged many old Spark releases from mirrors, as requested by the ASF, before realizing it would probably make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs
[ https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24624: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-22216 > Can not mix vectorized and non-vectorized UDFs > -- > > Key: SPARK-24624 > URL: https://issues.apache.org/jira/browse/SPARK-24624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > In the current impl, we have the limitation: users are unable to mix > vectorized and non-vectorized UDFs in same Project. This becomes worse since > our optimizer could combine continuous Projects into a single one. For > example, > {code} > applied_df = df.withColumn('regular', my_regular_udf('total', > 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty')) > {code} > Returns the following error. > {code} > IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs > java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized > UDFs > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24721: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-22216 > Failed to call PythonUDF whose input is the output of another PythonUDF > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at >
[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544656#comment-16544656 ] Li Jin commented on SPARK-24721: I am currently traveling but will try to take a look when I get back > Failed to call PythonUDF whose input is the output of another PythonUDF > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at >
[jira] [Updated] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot
[ https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24796: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-22216 > Support GROUPED_AGG_PANDAS_UDF in Pivot > --- > > Key: SPARK-24796 > URL: https://issues.apache.org/jira/browse/SPARK-24796 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to > support it. > {code} > # create input dataframe > from pyspark.sql import Row > data = [ > Row(id=123, total=200.0, qty=3, name='item1'), > Row(id=124, total=1500.0, qty=1, name='item2'), > Row(id=125, total=203.5, qty=2, name='item3'), > Row(id=126, total=200.0, qty=500, name='item1'), > ] > df = spark.createDataFrame(data) > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def pandas_avg(v): >return v.mean() > from pyspark.sql.functions import col, sum > > applied_df = > df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean')) > applied_df.show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot
[ https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544655#comment-16544655 ] Li Jin commented on SPARK-24796: Sorry I am traveling now but I will try to take a look when I get back > Support GROUPED_AGG_PANDAS_UDF in Pivot > --- > > Key: SPARK-24796 > URL: https://issues.apache.org/jira/browse/SPARK-24796 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to > support it. > {code} > # create input dataframe > from pyspark.sql import Row > data = [ > Row(id=123, total=200.0, qty=3, name='item1'), > Row(id=124, total=1500.0, qty=1, name='item2'), > Row(id=125, total=203.5, qty=2, name='item3'), > Row(id=126, total=200.0, qty=500, name='item1'), > ] > df = spark.createDataFrame(data) > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def pandas_avg(v): >return v.mean() > from pyspark.sql.functions import col, sum > > applied_df = > df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean')) > applied_df.show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24812: Assignee: Apache Spark > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Assignee: Apache Spark >Priority: Minor > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544651#comment-16544651 ] Apache Spark commented on SPARK-24812: -- User 'sujith71955' has created a pull request for this issue: https://github.com/apache/spark/pull/21775 > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Priority: Minor > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24812: Assignee: (was: Apache Spark) > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Priority: Minor > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24769) Support for parsing AVRO binary column
[ https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-24769. Resolution: Duplicate > Support for parsing AVRO binary column > -- > > Key: SPARK-24769 > URL: https://issues.apache.org/jira/browse/SPARK-24769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24770) Supporting to convert a column into binary of AVRO format
[ https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-24770: --- Comment: was deleted (was: The function `from_avro` and `to_avro` can be added in one PR: # The code is similar as `from_json` and `to_json` # Putting them together makes the unit test implementation easier. So I decide to close this issue and use https://issues.apache.org/jira/browse/SPARK-24811 instead) > Supporting to convert a column into binary of AVRO format > - > > Key: SPARK-24770 > URL: https://issues.apache.org/jira/browse/SPARK-24770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24770) Supporting to convert a column into binary of AVRO format
[ https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-24770. Resolution: Duplicate The function `from_avro` and `to_avro` can be added in one PR: # The code is similar as `from_json` and `to_json` # Putting them together makes the unit test implementation easier. So I decide to close this issue and use https://issues.apache.org/jira/browse/SPARK-24811 instead > Supporting to convert a column into binary of AVRO format > - > > Key: SPARK-24770 > URL: https://issues.apache.org/jira/browse/SPARK-24770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24769) Support for parsing AVRO binary column
[ https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544640#comment-16544640 ] Gengliang Wang commented on SPARK-24769: [~felipesmmelo] Thank you. But I have created a PR: https://github.com/apache/spark/pull/21774 > Support for parsing AVRO binary column > -- > > Key: SPARK-24769 > URL: https://issues.apache.org/jira/browse/SPARK-24769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24770) Supporting to convert a column into binary of AVRO format
[ https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544639#comment-16544639 ] Gengliang Wang commented on SPARK-24770: [~felipesmmelo] Thank you. But I have created a PR: https://github.com/apache/spark/pull/21774 > Supporting to convert a column into binary of AVRO format > - > > Key: SPARK-24770 > URL: https://issues.apache.org/jira/browse/SPARK-24770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24769) Support for parsing AVRO binary column
[ https://issues.apache.org/jira/browse/SPARK-24769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544636#comment-16544636 ] Gengliang Wang commented on SPARK-24769: The function `from_avro` and `to_avro` can be added in one PR: # The code is similar as `from_json` and `to_json` # Putting them together makes the unit test implementation easier. So I decide to close this issue and use https://issues.apache.org/jira/browse/SPARK-24811 instead. > Support for parsing AVRO binary column > -- > > Key: SPARK-24769 > URL: https://issues.apache.org/jira/browse/SPARK-24769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24770) Supporting to convert a column into binary of AVRO format
[ https://issues.apache.org/jira/browse/SPARK-24770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544637#comment-16544637 ] Gengliang Wang commented on SPARK-24770: The function `from_avro` and `to_avro` can be added in one PR: # The code is similar as `from_json` and `to_json` # Putting them together makes the unit test implementation easier. So I decide to close this issue and use https://issues.apache.org/jira/browse/SPARK-24811 instead. > Supporting to convert a column into binary of AVRO format > - > > Key: SPARK-24770 > URL: https://issues.apache.org/jira/browse/SPARK-24770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24812) Last Access Time in the table description is not valid
Sujith created SPARK-24812: -- Summary: Last Access Time in the table description is not valid Key: SPARK-24812 URL: https://issues.apache.org/jira/browse/SPARK-24812 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.2.1 Reporter: Sujith Last Access Time in the table description is not valid, Test steps: Step 1 - create a table Step 2 - Run command "DESC FORMATTED table" Last Access Time will always displayed wrong date Wed Dec 31 15:59:59 PST 1969 - which is wrong. In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong date. Seems to be a limitation as of now, better we can follow the hive behavior in this scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24811) Add function `from_avro` and `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544634#comment-16544634 ] Apache Spark commented on SPARK-24811: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21774 > Add function `from_avro` and `to_avro` > -- > > Key: SPARK-24811 > URL: https://issues.apache.org/jira/browse/SPARK-24811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24811) Add function `from_avro` and `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24811: Assignee: (was: Apache Spark) > Add function `from_avro` and `to_avro` > -- > > Key: SPARK-24811 > URL: https://issues.apache.org/jira/browse/SPARK-24811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24811) Add function `from_avro` and `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24811: Assignee: Apache Spark > Add function `from_avro` and `to_avro` > -- > > Key: SPARK-24811 > URL: https://issues.apache.org/jira/browse/SPARK-24811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24811) Add function `from_avro` and `to_avro`
Gengliang Wang created SPARK-24811: -- Summary: Add function `from_avro` and `to_avro` Key: SPARK-24811 URL: https://issues.apache.org/jira/browse/SPARK-24811 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang Add a new function from_avro for parsing a binary column of avro format and converting it into its corresponding catalyst value. Add a new function to_avro for converting a column into binary of avro format with the specified schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-24457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24457: -- Priority: Minor (was: Major) > Performance improvement while converting stringToTimestamp in DateTimeUtils > --- > > Key: SPARK-24457 > URL: https://issues.apache.org/jira/browse/SPARK-24457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sharad Sonker >Priority: Minor > > stringToTimestamp in DateTimeUtils creates Calendar instance for each input > row even if the input timezone is same. This can be improved by caching the > calendar instance for each input timezone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544558#comment-16544558 ] Apache Spark commented on SPARK-24810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21773 > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24810: Assignee: Apache Spark > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24810: Assignee: (was: Apache Spark) > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24810: --- Attachment: Screen Shot 2018-07-15 at 15.28.13.png > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24810) Fix paths to resource files in AvroSuite
Maxim Gekk created SPARK-24810: -- Summary: Fix paths to resource files in AvroSuite Key: SPARK-24810 URL: https://issues.apache.org/jira/browse/SPARK-24810 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently paths to tests files from resource folder are relative in AvroSuite. It causes problems like impossibility for running tests from IDE. Need to wrap test files by: {code:scala} def testFile(fileName: String): String = { Thread.currentThread().getContextClassLoader.getResource(fileName).toString } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24800) Refactor Avro Serializer and Deserializer
[ https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24800. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21762 [https://github.com/apache/spark/pull/21762] > Refactor Avro Serializer and Deserializer > - > > Key: SPARK-24800 > URL: https://issues.apache.org/jira/browse/SPARK-24800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24800) Refactor Avro Serializer and Deserializer
[ https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24800: --- Assignee: Gengliang Wang > Refactor Avro Serializer and Deserializer > - > > Key: SPARK-24800 > URL: https://issues.apache.org/jira/browse/SPARK-24800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24809: Assignee: Apache Spark > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Assignee: Apache Spark >Priority: Critical > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544534#comment-16544534 ] Apache Spark commented on SPARK-24809: -- User 'liutang123' has created a pull request for this issue: https://github.com/apache/spark/pull/21772 > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24809: Assignee: (was: Apache Spark) > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544528#comment-16544528 ] Daniel Mateus Pires commented on SPARK-24798: - +1 it's solved by using "Option" > sortWithinPartitions(xx) will failed in java.lang.NullPointerException > -- > > Key: SPARK-24798 > URL: https://issues.apache.org/jira/browse/SPARK-24798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: shengyao piao >Priority: Minor > > I have some issue in Spark 2.3 when I run bellow code in spark-shell or > spark-submit > I already figured out the reason of error is the name field contains > Some(null), > But I believe this code will run successfully in Spark 2.2 > Is it an expected behavior in Spark 2.3 ? > > ・Spark code > {code} > case class Hoge (id : Int,name : Option[String]) > val ds = > spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", > "id").withColumnRenamed("_2", "name").map(row => > Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name" > > ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty)) > {code} > ・Error > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.isEmpty(Iterator.scala:330) > at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
Lijia Liu created SPARK-24809: - Summary: Serializing LongHashedRelation in executor may result in data error Key: SPARK-24809 URL: https://issues.apache.org/jira/browse/SPARK-24809 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0 Environment: Spark 2.2.1 hadoop 2.7.1 Reporter: Lijia Liu When join key is long or int in broadcast join, Spark will use LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if the broadcast value is abnormal big, executor will serialize it to disk. But, data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544453#comment-16544453 ] Li Yuanjian commented on SPARK-24295: - Could you give more detailed information about how the compact file size growing up to 10GB in your scenario? As the implementation of FileStreamSinkLog, batches in compactInterval(default value is 10) will be merged into a single file, all the content in this file is serialized SinkFileStatus, it seems hardly can grow to 10GB. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org