[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631398#comment-16631398 ] Li Yuanjian commented on SPARK-10816: - Great thanks for [~kabhwan] notice me, just linked SPARK-22565 as duplicated with this, sorry for just searching "session window" before and lost this, will still find others duplicated jira. As discussed in SPARK-22565, we also meet this problem while doing the migration of streaming app running on other system to Structure Streaming. We solved this by implement the session window as build-in function and gave internal beta version based on Apache Spark 2.3.0 just week ago. After steady running online for real product env, we are doing the code clean work and doc translating. As discussed with Jungtaek, we also wished to join the discussion here and will give PR and design doc today. The preview pr I'll submit contains others patch. cc [~liulinhong] [~ivoson] [~yanlin-Lynn] [~LiangchangZ] , please watching this issue. > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631370#comment-16631370 ] Li Yuanjian commented on SPARK-22565: - [~kabhwan] Great thanks for noticing me, sorry for only searched the "session window" and missed SPARK-10816. {quote} It would be nice if you could also share the SPIP, as well as some PR or design doc, so that we could see spots on making co-work and get better product. {quote} No problem, I'll cherry-pick all related patch from internal folk, and actually we are translating the internal doc for few days, will also post a design doc today, let discuss in SPARK-10816. Thanks again for your reply. > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631295#comment-16631295 ] Jungtaek Lim edited comment on SPARK-25380 at 9/28/18 3:41 AM: --- IMHO it depends on how we see the issue and how we would like to tackle this. If we think 200M of plan string is normal and usual, you're right the issue lays in UI and UI should deal with it well. (Even 200M of single plan would be out of expectation on end users and they might miss to consider allocating enough space on driver side for UI, so purging old plan would work for some cases but not for some other cases.) If we don't think 200M of plan string is normal, we need to see actual case and investigate which physical node occupies much space on representing, and whether they're really needed or too verbose. If the huge string came from representing physical node itself which doesn't change among batches, we may be able to try storing template format of message for physical node and variables separately and apply just when page is requested. If we know more, we could have better solution. Since we are unlikely to get reproducer, I wouldn't want to block anyone to work on this. Anyone could tackle on UI issue. EDIT: I might misunderstand your previous comment, so just removed the lines I mentioned it. was (Author: kabhwan): IMHO it depends on how we see the issue and how we would like to tackle this. If we think 200M of plan string is normal and usual, you're right the issue lays in UI and UI should deal with it well. (Even 200M of single plan would be out of expectation on end users and they might miss to consider allocating enough space on driver side for UI, so purging old plan would work for some cases but not for some other cases.) If we don't think 200M of plan string is normal, we need to see actual case and investigate which physical node occupies much space on representing, and whether they're really needed or too verbose. If the huge string came from representing physical node itself which doesn't change among batches, we may be able to try storing template format of message for physical node and variables separately and apply just when page is requested. If we know more, we could have better solution: according to your previous comment, I guess we're on the same page: {quote}They seem to hold a lot more memory than just the plan graph structures do, it would be nice to know what exactly is holding on to that memory. {quote} Since we are unlikely to get reproducer, I wouldn't want to block anyone to work on this. Anyone could tackle on UI issue. > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631313#comment-16631313 ] Jungtaek Lim commented on SPARK-25380: -- Btw, reproducer still helps when we tackle it with only UI side. There're a few options to avoid memory issue: # Remove feature (or have option to "opt out") on showing physical plan description in UI. # Find some ways to dramatically reduce memory on storing physical plan description. # Purge old physical plan (count or memory based). 2 and 3 can be applied individually or together. If we are interested on 2, we would still want to have actual string to see how we reduce it (like compression/decompression). > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25563) Spark application hangs If container allocate on lost Nodemanager
[ https://issues.apache.org/jira/browse/SPARK-25563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] devinduan updated SPARK-25563: -- Summary: Spark application hangs If container allocate on lost Nodemanager (was: Spark application hangs) > Spark application hangs If container allocate on lost Nodemanager > - > > Key: SPARK-25563 > URL: https://issues.apache.org/jira/browse/SPARK-25563 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: devinduan >Priority: Minor > > I met a issue that if I start a spark application use yarn client mode, > application sometimes hang. > I check the application logs, container allocate on a lost NodeManager, > but AM don't retry to start another executor. > My spark version is 2.3.1 > Here is my ApplicationMaster log. > > 2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster > 2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over > to rm2 > 2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than > spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please > update your configs. > 2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of > spark.dynamicAllocation.initialExecutors, > spark.dynamicAllocation.minExecutors and spark.executor.instances > 2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor > container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of > overhead) > 2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container > requests. > 2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter > thread with (heartbeat : 3000, initial allocation : 200) intervals > 2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for > container: container_1532951609168_4721728_01_02 > 2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container > container_1532951609168_4721728_01_02 (state: COMPLETE, exit status: -100) > 2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: > container_1532951609168_4721728_01_02. Exit status: -100. Diagnostics: > Container released on a *lost* node -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25563) Spark application hangs
devinduan created SPARK-25563: - Summary: Spark application hangs Key: SPARK-25563 URL: https://issues.apache.org/jira/browse/SPARK-25563 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: devinduan I met a issue that if I start a spark application use yarn client mode, application sometimes hang. I check the application logs, container allocate on a lost NodeManager, but AM don't retry to start another executor. My spark version is 2.3.1 Here is my ApplicationMaster log. 2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster 2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over to rm2 2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs. 2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances 2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of overhead) 2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container requests. 2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals 2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for container: container_1532951609168_4721728_01_02 2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container container_1532951609168_4721728_01_02 (state: COMPLETE, exit status: -100) 2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: container_1532951609168_4721728_01_02. Exit status: -100. Diagnostics: Container released on a *lost* node -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631301#comment-16631301 ] Jungtaek Lim commented on SPARK-22565: -- [~XuanYuan] Hello, I've been initiating supporting this feature around a week ago in different JIRA issue. Please refer https://issues.apache.org/jira/browse/SPARK-10816 as well as SPIP discussion thread on dev@ mailing list. WIP version of PR is also available. [https://github.com/apache/spark/pull/22482] It would be nice if you could also share the SPIP, as well as some PR or design doc, so that we could see spots on making co-work and get better product. Thanks in advance! > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631298#comment-16631298 ] Li Yuanjian edited comment on SPARK-22565 at 9/28/18 3:23 AM: -- Thanks for reporting this. Actually we also met this problem in our usage, we have an implement about session window in internal folk to resolve this. After steady running online for real product env, we want to contribute to community within the next few days. We implemented this by a build-in function named session_window and corresponding support for window merge in Structure Streaming. The usage of dataframe api and SQL can be quickly browsing by the test: !screenshot-1.png! was (Author: xuanyuan): Thanks for reporting this. Actually we also met this problem in our usage, we have an implement about session window in internal folk to resolve this. After steady running online for real product env, we want to contribute to community within the next few days. We implemented this by a build-in function named session_window. The usage of dataframe api and SQL can be quickly browsing by the test: !screenshot-1.png! > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631300#comment-16631300 ] Li Yuanjian commented on SPARK-22565: - Also cc [~zsxwing] [~tdas], we are translating the design doc and will post a SPIP in these days, hope you can have a look when you have time, thanks :) > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631298#comment-16631298 ] Li Yuanjian commented on SPARK-22565: - Thanks for reporting this. Actually we also met this problem in our usage, we have an implement about session window in internal folk to resolve this. After steady running online for real product env, we want to contribute to community within the next few days. We implemented this by a build-in function named session_window. The usage of dataframe api and SQL can be quickly browsing by the test: !screenshot-1.png! > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-22565: Attachment: screenshot-1.png > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631295#comment-16631295 ] Jungtaek Lim commented on SPARK-25380: -- IMHO it depends on how we see the issue and how we would like to tackle this. If we think 200M of plan string is normal and usual, you're right the issue lays in UI and UI should deal with it well. (Even 200M of single plan would be out of expectation on end users and they might miss to consider allocating enough space on driver side for UI, so purging old plan would work for some cases but not for some other cases.) If we don't think 200M of plan string is normal, we need to see actual case and investigate which physical node occupies much space on representing, and whether they're really needed or too verbose. If the huge string came from representing physical node itself which doesn't change among batches, we may be able to try storing template format of message for physical node and variables separately and apply just when page is requested. If we know more, we could have better solution: according to your previous comment, I guess we're on the same page: {quote}They seem to hold a lot more memory than just the plan graph structures do, it would be nice to know what exactly is holding on to that memory. {quote} Since we are unlikely to get reproducer, I wouldn't want to block anyone to work on this. Anyone could tackle on UI issue. > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25562) The Spark add audit log
yinghua_zh created SPARK-25562: -- Summary: The Spark add audit log Key: SPARK-25562 URL: https://issues.apache.org/jira/browse/SPARK-25562 Project: Spark Issue Type: New Feature Components: Spark Submit Affects Versions: 2.1.0 Reporter: yinghua_zh At present, spark does not record audit logs, and can increase audit logs for security reasons. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25560) Allow Function Injection in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25560: Assignee: Apache Spark > Allow Function Injection in SparkSessionExtensions > -- > > Key: SPARK-25560 > URL: https://issues.apache.org/jira/browse/SPARK-25560 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Russell Spitzer >Assignee: Apache Spark >Priority: Major > > Currently there is no way to add a set of external functions to all sessions > made by users. We could add a small extension to SparkSessionExtensions which > would allow this to be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25560) Allow Function Injection in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25560: Assignee: (was: Apache Spark) > Allow Function Injection in SparkSessionExtensions > -- > > Key: SPARK-25560 > URL: https://issues.apache.org/jira/browse/SPARK-25560 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Russell Spitzer >Priority: Major > > Currently there is no way to add a set of external functions to all sessions > made by users. We could add a small extension to SparkSessionExtensions which > would allow this to be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25560) Allow Function Injection in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631257#comment-16631257 ] Apache Spark commented on SPARK-25560: -- User 'RussellSpitzer' has created a pull request for this issue: https://github.com/apache/spark/pull/22576 > Allow Function Injection in SparkSessionExtensions > -- > > Key: SPARK-25560 > URL: https://issues.apache.org/jira/browse/SPARK-25560 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Russell Spitzer >Priority: Major > > Currently there is no way to add a set of external functions to all sessions > made by users. We could add a small extension to SparkSessionExtensions which > would allow this to be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24630: Assignee: Apache Spark > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Assignee: Apache Spark >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24630: Assignee: (was: Apache Spark) > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631253#comment-16631253 ] Apache Spark commented on SPARK-24630: -- User 'stczwd' has created a pull request for this issue: https://github.com/apache/spark/pull/22575 > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631236#comment-16631236 ] Marcelo Vanzin commented on SPARK-25380: If all you want is see this live, why do you need a query that generates a large plan? Just hack whatever method creates the plan string to return a really large string full of garbage. Same result. The goal here is not to reproduce the original problem, but to provide a solution for when that happens; there's no problem with his query other than the plan being large and the UI not dealing with that well. > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25459) Add viewOriginalText back to CatalogTable
[ https://issues.apache.org/jira/browse/SPARK-25459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631232#comment-16631232 ] Dongjoon Hyun commented on SPARK-25459: --- [~chrisz28]. It seems that JIRA is unable to accept your id. !error_message.png! > Add viewOriginalText back to CatalogTable > - > > Key: SPARK-25459 > URL: https://issues.apache.org/jira/browse/SPARK-25459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Zheyuan Zhao >Priority: Major > Attachments: error_message.png > > > The {{show create table}} will show a lot of generated attributes for views > that created by older Spark version. See this test suite > https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala#L115. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25459) Add viewOriginalText back to CatalogTable
[ https://issues.apache.org/jira/browse/SPARK-25459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25459: -- Attachment: error_message.png > Add viewOriginalText back to CatalogTable > - > > Key: SPARK-25459 > URL: https://issues.apache.org/jira/browse/SPARK-25459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Zheyuan Zhao >Priority: Major > Attachments: error_message.png > > > The {{show create table}} will show a lot of generated attributes for views > that created by older Spark version. See this test suite > https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala#L115. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631206#comment-16631206 ] Jungtaek Lim commented on SPARK-10816: -- To avoid any concerns/wonders, I believe my proposal and map/flatMapGroupsWithState can co-exist. map/flatMapGroupsWithState target for general (arbitrary, not specific to window) cases but can't be fully optimized for any specific cases by nature of "generalization". Edge-case also came out from generalization, and if we tackle the edge-case with map/flatMapGroupsWithState as supporting multiple values per key, it would be non-trivial overhead for the cases which don't need to have multiple values per key, as well as state function may be more complicated or have couple of forms. The point is whether simple gap session window is worth to be treated for first class use cases. Spark supports tumble/slide window natively because we see it's worth. I think we see worth of supporting session window since we have example on sessionization, and I guess supporting it natively would give much benefit over adding some complexities. Same thing would apply if we would add some other API (function, or DSL) for supporting custom window for follow-up issue (SPARK-2 as Arun stated). If we feel much convenient and see its worth to support it natively instead of let end users play with map/flatMapGroupsWithState, it can be the thing to go. > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
[ https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631193#comment-16631193 ] Karthik Manamcheri commented on SPARK-25561: The root cause was from SPARK-17992 ping [~michael] what are your thoughts on this? > HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql > -- > > Key: SPARK-25561 > URL: https://issues.apache.org/jira/browse/SPARK-25561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Karthik Manamcheri >Priority: Major > > In HiveShim.scala, the current behavior is that if > hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter > call to succeed. If it fails, we'll throw a RuntimeException. > However, this might not always be the case. Hive's direct SQL functionality > is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark > should handle that exception correctly if Hive falls back to ORM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
Karthik Manamcheri created SPARK-25561: -- Summary: HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql Key: SPARK-25561 URL: https://issues.apache.org/jira/browse/SPARK-25561 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Karthik Manamcheri In HiveShim.scala, the current behavior is that if hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter call to succeed. If it fails, we'll throw a RuntimeException. However, this might not always be the case. Hive's direct SQL functionality is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark should handle that exception correctly if Hive falls back to ORM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631186#comment-16631186 ] Hyukjin Kwon edited comment on SPARK-18112 at 9/27/18 11:29 PM: To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 2.3.2. Now it's somehow blocked and I need some input there. was (Author: hyukjin.kwon): To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 3.0.0. Now it's somehow blocked and I need some input there. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631186#comment-16631186 ] Hyukjin Kwon commented on SPARK-18112: -- To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 3.0.0. Now it's somehow blocked and I need some input there. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631184#comment-16631184 ] Apache Spark commented on SPARK-25559: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631183#comment-16631183 ] Apache Spark commented on SPARK-25559: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631182#comment-16631182 ] DB Tsai commented on SPARK-25559: - https://github.com/apache/spark/pull/22574 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25556: Comment: was deleted (was: User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574) > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25556: Comment: was deleted (was: User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574) > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631159#comment-16631159 ] Hyukjin Kwon commented on SPARK-18112: -- {quote} If I use that 1.2.1 fork I was getting some query errors due to me using bloom filters on multiple columns of the table {quote} Can you file another JIRA then? Some tests were added for that (SPARK-25427). If that's not respected, it sounds another problem. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1663#comment-1663 ] Apache Spark commented on SPARK-25556: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574 > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25556: Assignee: Apache Spark (was: DB Tsai) > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631108#comment-16631108 ] Apache Spark commented on SPARK-25556: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22574 > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25556: Assignee: DB Tsai (was: Apache Spark) > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25560) Allow Function Injection in SparkSessionExtensions
Russell Spitzer created SPARK-25560: --- Summary: Allow Function Injection in SparkSessionExtensions Key: SPARK-25560 URL: https://issues.apache.org/jira/browse/SPARK-25560 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.4.0 Reporter: Russell Spitzer Currently there is no way to add a set of external functions to all sessions made by users. We could add a small extension to SparkSessionExtensions which would allow this to be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25559: Description: Currently, in *ParquetFilters*, if one of the children predicates is not supported by Parquet, the entire predicates will be thrown away. In fact, if the unsupported predicate is in the top level *And* condition or in the child before hitting *Not* or *Or* condition, it's safe to just remove the unsupported one as unhandled filters. (was: Currently, in *ParquetFilters*, if one of the children predicate is not supported by Parquet, the entire predicates will be thrown away. In fact, if the unsupported predicate is in the top level *And* condition or in the child before hitting *Not* or *Or* condition, it's safe to just remove the unsupported one as unhandled filters.) > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630947#comment-16630947 ] Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 8:12 PM: -- Nice work. Since I cant comment on the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis for the single DF machine support. It supports numpy arrays as java object, of course there are downsides. Are there any tasks for this defined? was (Author: skonto): Nice work. Since I cant comment on the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis for the single DF machine support. Are there any tasks for this defined? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630947#comment-16630947 ] Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:37 PM: -- Nice work. Since I cant comment on the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis for the single DF machine support. Are there any tasks for this defined? was (Author: skonto): Nice work. Since I cant comment on the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. Are there any tasks for this defined? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630947#comment-16630947 ] Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:36 PM: -- Nice work. Since I cant comment on the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. Are there any tasks for this defined? was (Author: skonto): Nice work. Since I cant comment in the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. Are there any tasks for this defined? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630947#comment-16630947 ] Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:35 PM: -- Nice work. Since I cant comment in the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. Are there any tasks for this defined? was (Author: skonto): Nice work. Since I cant comment in the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630947#comment-16630947 ] Stavros Kontopoulos commented on SPARK-24579: - Nice work. Since I cant comment in the design doc, have you checked [https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the scala/java apis. > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630912#comment-16630912 ] Stavros Kontopoulos edited comment on SPARK-24724 at 9/27/18 7:06 PM: -- AFAIK in standalone mode passwordless communication (ssh) is needed between workers or in general between master-workers. Right now the design proposal uses netty rpc for task coordination for the barriers via the default rpc Env (env.rpcEnv.setupEndpoint call). For that part I dont see any issue, but when mpi tasks are run they may need extra net configuration I suppose (like ports?). was (Author: skonto): AFAIK in standalone mode passwordless communication is needed between workers or in general between master-workers. Right now the design proposal uses netty rpc for task coordination for the barriers via the default rpc Env (env.rpcEnv.setupEndpoint call). For that part I dont see any issue, but when mpi tasks are run they may need extra net configuration? > Discuss necessary info and access in barrier mode + Kubernetes > -- > > Key: SPARK-24724 > URL: https://issues.apache.org/jira/browse/SPARK-24724 > Project: Spark > Issue Type: Story > Components: Kubernetes, ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Yinan Li >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + Kubernetes. There were > some past and on-going attempts from the Kubenetes community. So we should > find someone with good knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630912#comment-16630912 ] Stavros Kontopoulos commented on SPARK-24724: - AFAIK in standalone mode passwordless communication is needed between workers or in general between master-workers. Right now the design proposal uses netty rpc for task coordination for the barriers via the default rpc Env (env.rpcEnv.setupEndpoint call). For that part I dont see any issue, but when mpi tasks are run they may need extra net configuration? > Discuss necessary info and access in barrier mode + Kubernetes > -- > > Key: SPARK-24724 > URL: https://issues.apache.org/jira/browse/SPARK-24724 > Project: Spark > Issue Type: Story > Components: Kubernetes, ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Yinan Li >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + Kubernetes. There were > some past and on-going attempts from the Kubenetes community. So we should > find someone with good knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25334) Default SessionCatalog should support UDFs
[ https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630890#comment-16630890 ] Swapnil Chougule commented on SPARK-25334: -- +1 My work has also been impacted by same. SessionCatalog calls registerFunction to add a function to function registry. However, makeFunctionExpression supports only UserDefinedAggregateFunction. Can we have handler for scala udf, java udf & java udaf as well in SessionCatalog? > Default SessionCatalog should support UDFs > -- > > Key: SPARK-25334 > URL: https://issues.apache.org/jira/browse/SPARK-25334 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Tomasz Gawęda >Priority: Major > > SessionCatalog calls registerFunction to add a function to function registry. > However, makeFunctionExpression supports only UserDefinedAggregateFunction. > We should make makeFunctionExpression support UserDefinedFunction, as it's > one of functions type. > Currently we can use persistent functions only with Hive metastore, but > "create function" command works also with default SessionCatalog. It > sometimes cause user confusion, like in > https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630879#comment-16630879 ] Dongjoon Hyun commented on SPARK-25557: --- Got it, [~dbtsai] . > ORC predicate pushdown for nested fields > > > Key: SPARK-25557 > URL: https://issues.apache.org/jira/browse/SPARK-25557 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630871#comment-16630871 ] Apache Spark commented on SPARK-25558: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22573 > Pushdown predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630870#comment-16630870 ] Apache Spark commented on SPARK-25558: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22573 > Pushdown predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25558: Assignee: DB Tsai (was: Apache Spark) > Pushdown predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25558: Assignee: Apache Spark (was: DB Tsai) > Pushdown predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25558: Summary: Pushdown predicates for nested fields in DataSource Strategy (was: Creating predicates for nested fields in DataSource Strategy ) > Pushdown predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25559) Just remove the unsupported predicates in Parquet
DB Tsai created SPARK-25559: --- Summary: Just remove the unsupported predicates in Parquet Key: SPARK-25559 URL: https://issues.apache.org/jira/browse/SPARK-25559 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: DB Tsai Assignee: DB Tsai Currently, in *ParquetFilters*, if one of the children predicate is not supported by Parquet, the entire predicates will be thrown away. In fact, if the unsupported predicate is in the top level *And* condition or in the child before hitting *Not* or *Or* condition, it's safe to just remove the unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25558: Issue Type: Sub-task (was: New Feature) Parent: SPARK-25556 > Creating predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy
[ https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25558: --- Assignee: DB Tsai > Creating predicates for nested fields in DataSource Strategy > - > > Key: SPARK-25558 > URL: https://issues.apache.org/jira/browse/SPARK-25558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy
DB Tsai created SPARK-25558: --- Summary: Creating predicates for nested fields in DataSource Strategy Key: SPARK-25558 URL: https://issues.apache.org/jira/browse/SPARK-25558 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25557) ORC predicate pushdown for nested fields
DB Tsai created SPARK-25557: --- Summary: ORC predicate pushdown for nested fields Key: SPARK-25557 URL: https://issues.apache.org/jira/browse/SPARK-25557 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: DB Tsai Assignee: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25556: --- Assignee: DB Tsai > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17636) Parquet predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-17636: Summary: Parquet predicate pushdown for nested fields (was: Parquet filter push down doesn't handle struct fields) > Parquet predicate pushdown for nested fields > > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Minor > Fix For: 2.5.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25556) Predicate Pushdown for Nested fields
DB Tsai created SPARK-25556: --- Summary: Predicate Pushdown for Nested fields Key: SPARK-25556 URL: https://issues.apache.org/jira/browse/SPARK-25556 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: DB Tsai This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630830#comment-16630830 ] Christopher Said commented on SPARK-25461: -- For what it's worth, converting all floats to False goes against my expectations. > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-17636: Issue Type: Sub-task (was: Bug) Parent: SPARK-25556 > Parquet filter push down doesn't handle struct fields > - > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Minor > Fix For: 2.5.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25521) Job id showing null when insert into command Job is finished.
[ https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630827#comment-16630827 ] Apache Spark commented on SPARK-25521: -- User 'sujith71955' has created a pull request for this issue: https://github.com/apache/spark/pull/22572 > Job id showing null when insert into command Job is finished. > - > > Key: SPARK-25521 > URL: https://issues.apache.org/jira/browse/SPARK-25521 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Babulal >Priority: Minor > Attachments: image-2018-09-25-12-01-31-871.png > > > scala> spark.sql("create table x1(name string,age int) stored as parquet") > scala> spark.sql("insert into x1 select 'a',29") > check logs > 2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 > (TID 0) in 874 ms on localhost (executor > driver) (1/1) > 2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose > tasks have all completed, from pool > 2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at > :24) finished in 1.131 s > 2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at > :24, took 1.233329 s > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job > {color:#d04437}null{color} committed. > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for > job null. > res4: org.apache.spark.sql.DataFrame = [] > > !image-2018-09-25-12-01-31-871.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25521) Job id showing null when insert into command Job is finished.
[ https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25521: Assignee: Apache Spark > Job id showing null when insert into command Job is finished. > - > > Key: SPARK-25521 > URL: https://issues.apache.org/jira/browse/SPARK-25521 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Babulal >Assignee: Apache Spark >Priority: Minor > Attachments: image-2018-09-25-12-01-31-871.png > > > scala> spark.sql("create table x1(name string,age int) stored as parquet") > scala> spark.sql("insert into x1 select 'a',29") > check logs > 2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 > (TID 0) in 874 ms on localhost (executor > driver) (1/1) > 2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose > tasks have all completed, from pool > 2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at > :24) finished in 1.131 s > 2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at > :24, took 1.233329 s > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job > {color:#d04437}null{color} committed. > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for > job null. > res4: org.apache.spark.sql.DataFrame = [] > > !image-2018-09-25-12-01-31-871.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25521) Job id showing null when insert into command Job is finished.
[ https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25521: Assignee: (was: Apache Spark) > Job id showing null when insert into command Job is finished. > - > > Key: SPARK-25521 > URL: https://issues.apache.org/jira/browse/SPARK-25521 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Babulal >Priority: Minor > Attachments: image-2018-09-25-12-01-31-871.png > > > scala> spark.sql("create table x1(name string,age int) stored as parquet") > scala> spark.sql("insert into x1 select 'a',29") > check logs > 2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 > (TID 0) in 874 ms on localhost (executor > driver) (1/1) > 2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose > tasks have all completed, from pool > 2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at > :24) finished in 1.131 s > 2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at > :24, took 1.233329 s > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job > {color:#d04437}null{color} committed. > 2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for > job null. > res4: org.apache.spark.sql.DataFrame = [] > > !image-2018-09-25-12-01-31-871.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2
[ https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25533. Resolution: Fixed Fix Version/s: 2.4.0 2.3.3 > Inconsistent message for Completed Jobs in the JobUI, when there are failed > jobs, compared to spark2.2 > --- > > Key: SPARK-25533 > URL: https://issues.apache.org/jira/browse/SPARK-25533 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: shahid >Assignee: shahid >Priority: Major > Fix For: 2.3.3, 2.4.0 > > Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from > 2018-09-26 00-46-35.png > > > Test steps: > 1) bin/spark-shell > {code:java} > sc.parallelize(1 to 5, 5).collect() > sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail > Job")}.collect() > {code} > *Output in spark - 2.3.1:* > !Screenshot from 2018-09-26 00-42-00.png! > *Output in spark - 2.2.1:* > !Screenshot from 2018-09-26 00-46-35.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630781#comment-16630781 ] Apache Spark commented on SPARK-25392: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/22571 > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25392: Assignee: Apache Spark > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Apache Spark >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630779#comment-16630779 ] Apache Spark commented on SPARK-25392: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/22571 > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25392: Assignee: (was: Apache Spark) > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25454: Assignee: Apache Spark > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Major > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25454: Assignee: (was: Apache Spark) > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Priority: Major > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630697#comment-16630697 ] Arun Mahadevan commented on SPARK-10816: Reiterating earlier point. "map/flapMapGroupsWithState" is GBK + mapping the grouped values to state on the reduce side. The proposed native session window implements sessions as GBK operation (similar to tumbling and sliding windows) plus makes it easy for users to express gap based sessions. IMO, we could consider refactoring windowing operations to generic "assignWindows" and "mergeWindows" constructs. (https://issues.apache.org/jira/browse/SPARK-2) > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25546: - Assignee: Marcelo Vanzin > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25546. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22558 [https://github.com/apache/spark/pull/22558] > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630689#comment-16630689 ] Eugeniu commented on SPARK-18112: - _"The problem here looks, you guys completely replaced the jars into higher Hive jars. Therefore, it throws {{NoSuchFieldError}}_" - yes you are right. That was my intent. I wanted to be able to connect to metastore database created by a Hive client 2.x. If I use that 1.2.1 fork I was getting some query errors due to me using bloom filters on multiple columns of the table. My understanding is that Hive client 1.2.1 is not seeing that information that is why I was trying to replace the jars for a higher version. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25555) Generic constructs for windowing and support for custom windows
Arun Mahadevan created SPARK-2: -- Summary: Generic constructs for windowing and support for custom windows Key: SPARK-2 URL: https://issues.apache.org/jira/browse/SPARK-2 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Arun Mahadevan Refactor windows logic with generic "assignWindows" and "mergeWindows" constructs. The existing windows (Tumbling and sliding) and generic session windows can be built on top of this. It could be extended to support different types of custom windowing. K,Values -> AssignWindows (produces [k, v, timestamp, window]) -> GroupByKey (shuffle) -> MergeWindows (optional step) -> GroupWindows -> aggregate values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630672#comment-16630672 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] thanks so much for your hard work, let me look at your SPIP, and give some comments > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25415: - Fix Version/s: (was: 3.0.0) 2.5.0 > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 2.5.0 > > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25554) Avro logical types get ignored in SchemaConverters.toSqlType
Yanan Li created SPARK-25554: Summary: Avro logical types get ignored in SchemaConverters.toSqlType Key: SPARK-25554 URL: https://issues.apache.org/jira/browse/SPARK-25554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: Below is the maven dependencies: {code:java} org.apache.avro avro 1.8.2 com.databricks spark-avro_2.11 4.0.0 org.apache.spark spark-core_2.11 2.3.0 org.apache.spark spark-sql_2.11 2.3.0 {code} Reporter: Yanan Li Having Avro schema defined as follow: {code:java} { "namespace": "com.xxx.avro", "name": "Book", "type": "record", "fields": [{ "name": "name", "type": ["null", "string"], "default": null }, { "name": "author", "type": ["null", "string"], "default": null }, { "name": "published_date", "type": ["null", {"type": "int", "logicalType": "date"}], "default": null } ] } {code} Spark Schema converted from above Avro schema, logical type "date" gets ignored. {code:java} StructType(StructField(name,StringType,true),StructField(author,StringType,true),StructField(published_date,IntegerType,true)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630543#comment-16630543 ] Hyukjin Kwon commented on SPARK-18112: -- Also, take a look for the codes, open JIRAs and PRs before complaining next time. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630542#comment-16630542 ] Hyukjin Kwon commented on SPARK-18112: -- To be more specific, the code you guys pointed out is executed by Spark's Hive fork 1.2.1 which contains that configuration (https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1290) That's meant to be executed with Spark's Hive fork. So you should leave the jar as is. And then, the higher jars for Hive to create Hive client should be provided to {{spark.sql.hive.metastore.jars}} and {{spark.sql.hive.metastore.version}} should be set accordingly. The problem here looks, you guys completely replaced the jars into higher Hive jars. Therefore, it throws {{NoSuchFieldError}} I recently manually tested 1.2.1, 2.3.0 and 3.0.0 (against https://github.com/apache/spark/pull/21404) in few months ago against Apache Spark. I am pretty sure that it works for now. If I am mistaken or misunderstood at some points, please provide a reproducible step, or at least why it fails. Let me take a look. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630485#comment-16630485 ] Hyukjin Kwon commented on SPARK-18112: -- At that code you pointed out, Spark's Hive fork should be used. Different jar provided in {{spark.sql.hive.metastore.jars}} is used to create the correspending client to access different version of Hive via isolated classloader. The problem here is, you guys removed hive-1.2.1 in the jars and didn't provide newer Hive jars in {{spark.sql.hive.metastore.jars}} properly. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630487#comment-16630487 ] Hyukjin Kwon commented on SPARK-18112: -- I asked questions because i'm pretty sure it's misconfiguration. That's why I am asking reproducible steps. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25553: -- Priority: Minor (was: Major) > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Minor > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630472#comment-16630472 ] Tavis Barr commented on SPARK-18112: Why are you even asking these questions? I have already pointed to the offending lines of code in /src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala that are causing this error and explained why the error is happening (see my comments from April 23rd). All you have to do is remove those two parameters and push. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25523) Multi thread execute sparkSession.read().jdbc(url, table, properties) problem
[ https://issues.apache.org/jira/browse/SPARK-25523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630421#comment-16630421 ] huanghuai commented on SPARK-25523: --- Sorry, I know this question is hard to understand, but my project is right on windows 10 , and when a http request coming, it will create a sparkSession and run a spark application, I am also helpless. I will try my best to solve this problem , If it's not solved in a few days later, I will close this issue. I know ,this problem is hard to you to reproduce. > Multi thread execute sparkSession.read().jdbc(url, table, properties) problem > - > > Key: SPARK-25523 > URL: https://issues.apache.org/jira/browse/SPARK-25523 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: h3. [IntelliJ > _IDEA_|http://www.baidu.com/link?url=7ZLtsOfyqR1YxLqcTU0Q-hqXWV_PsY6IzIzZoKhiXZZ4AcLrpQ4DoTG30yIN-Gs8] > > local mode > >Reporter: huanghuai >Priority: Major > > public static void test2() throws Exception{ > String ckUrlPrefix="jdbc:clickhouse://"; > String quote = "`"; > JdbcDialects.registerDialect(new JdbcDialect() { > @Override > public boolean canHandle(String url) > { return url.startsWith(ckUrlPrefix); } > @Override > public String quoteIdentifier(String colName) > { return quote + colName + quote; } > }); > SparkSession spark = initSpark(); > String ckUrl = "jdbc:clickhouse://192.168.2.148:8123/default"; > Properties ckProp = new Properties(); > ckProp.put("user", "default"); > ckProp.put("password", ""); > String prestoUrl = "jdbc:presto://192.168.2.148:9002/mysql-xxx/xxx"; > Properties prestoUrlProp = new Properties(); > prestoUrlProp.put("user", "root"); > prestoUrlProp.put("password", ""); > // new Thread(()->{ > // spark.read() > // .jdbc(ckUrl, "ontime", ckProp).show(); > // }).start(); > System.out.println("--"); > new Thread(()->{ > spark.read() > .jdbc(prestoUrl, "tx_user", prestoUrlProp).show(); > }).start(); > System.out.println("--"); > new Thread(()->{ > Dataset load = spark.read() > .format("com.vertica.spark.datasource.DefaultSource") > .option("host", "192.168.1.102") > .option("port", 5433) > .option("user", "dbadmin") > .option("password", "manager") > .option("db", "test") > .option("dbschema", "public") > .option("table", "customers") > .load(); > load.printSchema(); > load.show(); > }).start(); > System.out.println("--"); > } > public static SparkSession initSpark() throws Exception > { return SparkSession.builder() .master("spark://dsjkfb1:7077") > //spark://dsjkfb1:7077 .appName("Test") .config("spark.executor.instances",3) > .config("spark.executor.cores",2) .config("spark.cores.max",6) > //.config("spark.default.parallelism",1) > .config("spark.submit.deployMode","client") > .config("spark.driver.memory","2G") .config("spark.executor.memory","3G") > .config("spark.driver.maxResultSize", "2G") .config("spark.local.dir", > "d:\\tmp") .config("spark.driver.host", "192.168.2.148") > .config("spark.scheduler.mode", "FAIR") .config("spark.jars", > "F:\\project\\xxx\\vertica-jdbc-7.0.1-0.jar," + > "F:\\project\\xxx\\clickhouse-jdbc-0.1.40.jar," + > "F:\\project\\xxx\\vertica-spark-connector-9.1-2.1.jar," + > "F:\\project\\xxx\\presto-jdbc-0.189-mining.jar") .getOrCreate(); } > > > {color:#ff}* The above is code > --*{color} > {color:#ff}*question: If i open vertica jdbc , thread will pending > forever.*{color} > {color:#ff}*And driver loging like this:*{color} > > 2018-09-26 10:32:51 INFO SharedState:54 - Setting > hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir > ('file:/C:/Users/admin/Desktop/test-project/sparktest/spark-warehouse/'). > 2018-09-26 10:32:51 INFO SharedState:54 - Warehouse path is > 'file:/C:/Users/admin/Desktop/test-project/sparktest/spark-warehouse/'. > 2018-09-26 10:32:51 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@2f70d6e2\{/SQL,null,AVAILABLE,@Spark} > 2018-09-26 10:32:51 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@1d66833d\{/SQL/json,null,AVAILABLE,@Spark} > 2018-09-26 10:32:51 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@65af6f3a\{/SQL/execution,null,AVAILABLE,@Spark} > 2018-09-26 10:32:51 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@55012968\{/SQL/execution/json,null,AVAILABLE,@Spark} > 2018-09-26 10:32:51 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@59e3f5aa\{/static/sql,null,AVAILABLE,@Spark} > 2018
[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630410#comment-16630410 ] Felix Cheung commented on SPARK-21291: -- Wait. I don’t think saveAsTable is the same thing? > R bucketBy partitionBy API > -- > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 2.5.0 > > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25553: Assignee: (was: Apache Spark) > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630401#comment-16630401 ] Apache Spark commented on SPARK-25553: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22570 > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25553: Assignee: Apache Spark > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630400#comment-16630400 ] Apache Spark commented on SPARK-25553: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22570 > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21743) top-most limit should not cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21743. - Resolution: Fixed Fix Version/s: 2.4.0 > top-most limit should not cause memory leak > --- > > Key: SPARK-21743 > URL: https://issues.apache.org/jira/browse/SPARK-21743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21436) Take advantage of known partioner for distinct on RDDs
[ https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21436. - Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22010 [https://github.com/apache/spark/pull/22010] > Take advantage of known partioner for distinct on RDDs > -- > > Key: SPARK-21436 > URL: https://issues.apache.org/jira/browse/SPARK-21436 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 2.5.0 > > > If we have a known partitioner we should be able to avoid the shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21436) Take advantage of known partioner for distinct on RDDs
[ https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21436: --- Assignee: holdenk > Take advantage of known partioner for distinct on RDDs > -- > > Key: SPARK-21436 > URL: https://issues.apache.org/jira/browse/SPARK-21436 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: holdenk >Priority: Minor > > If we have a known partitioner we should be able to avoid the shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
Yuming Wang created SPARK-25553: --- Summary: Add EmptyInterpolatedStringChecker to scalastyle-config.xml Key: SPARK-25553 URL: https://issues.apache.org/jira/browse/SPARK-25553 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.5.0 Reporter: Yuming Wang h4. Justification Empty interpolated strings are harder to read and not necessary. More details: http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25522) Improve type promotion for input arguments of elementAt function
[ https://issues.apache.org/jira/browse/SPARK-25522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25522: Fix Version/s: (was: 2.5.0) 2.4.0 > Improve type promotion for input arguments of elementAt function > - > > Key: SPARK-25522 > URL: https://issues.apache.org/jira/browse/SPARK-25522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > In ElementAt, when first argument is MapType, we should coerce the key type > and the second argument based on findTightestCommonType. This is not > happening currently. > Also, when the first argument is ArrayType, the second argument should be an > integer type or a smaller integral type that can be safely casted to an > integer type. Currently we may do an unsafe cast. > {code:java} > spark-sql> select element_at(array(1,2), 1.24); > 1{code} > {code:java} > spark-sql> select element_at(map(1,"one", 2, "two"), 2.2); > two{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25551) Remove unused InSubquery expression
[ https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25551. - Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22556 [https://github.com/apache/spark/pull/22556] > Remove unused InSubquery expression > --- > > Key: SPARK-25551 > URL: https://issues.apache.org/jira/browse/SPARK-25551 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.5.0 > > > SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was > removed in SPARK-18874. Hence now it is not used anymore and it can be > removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25551) Remove unused InSubquery expression
[ https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25551: --- Assignee: Marco Gaido > Remove unused InSubquery expression > --- > > Key: SPARK-25551 > URL: https://issues.apache.org/jira/browse/SPARK-25551 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.5.0 > > > SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was > removed in SPARK-18874. Hence now it is not used anymore and it can be > removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org