[jira] [Created] (SPARK-7993) Improve DataFrame.show() output
Reynold Xin created SPARK-7993: -- Summary: Improve DataFrame.show() output Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-7799: Summary: Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext (was: Move StreamingContext.actorStream to a separate project) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7966) add Spreading Activation algorithm to GraphX
[ https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566967#comment-14566967 ] Apache Spark commented on SPARK-7966: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6549 add Spreading Activation algorithm to GraphX Key: SPARK-7966 URL: https://issues.apache.org/jira/browse/SPARK-7966 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Tarek Auel Priority: Minor I'm wondering if you would like to add the Spreading Activation algorithm to GraphX. I have implemented it, using the Pregel-API and would love to share it with the community. Spreading activation is a algorithm that was invented to search in associative networks. The basic idea is, that you have one (or multiple) starting nodes. The activation spreads out from these nodes to the neighbours and the neighbours of the neighbours. The activation decreases after every hop. Nodes that were reached by many activations will have a higher total activation level. Spreading Activation is for many use cases useful. Imagine you have the social network of two people. If you apply the spreading activation to this social graph with the two people as starting nodes, you will get the nodes that are most important for both. Some resources: http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7966) add Spreading Activation algorithm to GraphX
[ https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7966: --- Assignee: (was: Apache Spark) add Spreading Activation algorithm to GraphX Key: SPARK-7966 URL: https://issues.apache.org/jira/browse/SPARK-7966 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Tarek Auel Priority: Minor I'm wondering if you would like to add the Spreading Activation algorithm to GraphX. I have implemented it, using the Pregel-API and would love to share it with the community. Spreading activation is a algorithm that was invented to search in associative networks. The basic idea is, that you have one (or multiple) starting nodes. The activation spreads out from these nodes to the neighbours and the neighbours of the neighbours. The activation decreases after every hop. Nodes that were reached by many activations will have a higher total activation level. Spreading Activation is for many use cases useful. Imagine you have the social network of two people. If you apply the spreading activation to this social graph with the two people as starting nodes, you will get the nodes that are most important for both. Some resources: http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7966) add Spreading Activation algorithm to GraphX
[ https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7966: --- Assignee: Apache Spark add Spreading Activation algorithm to GraphX Key: SPARK-7966 URL: https://issues.apache.org/jira/browse/SPARK-7966 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Tarek Auel Assignee: Apache Spark Priority: Minor I'm wondering if you would like to add the Spreading Activation algorithm to GraphX. I have implemented it, using the Pregel-API and would love to share it with the community. Spreading activation is a algorithm that was invented to search in associative networks. The basic idea is, that you have one (or multiple) starting nodes. The activation spreads out from these nodes to the neighbours and the neighbours of the neighbours. The activation decreases after every hop. Nodes that were reached by many activations will have a higher total activation level. Spreading Activation is for many use cases useful. Imagine you have the social network of two people. If you apply the spreading activation to this social graph with the two people as starting nodes, you will get the nodes that are most important for both. Some resources: http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7994) Remove StreamingContext.actorStream
Shixiong Zhu created SPARK-7994: --- Summary: Remove StreamingContext.actorStream Key: SPARK-7994 URL: https://issues.apache.org/jira/browse/SPARK-7994 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-7799: Target Version/s: 1.5.0 (was: 1.6.0) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7995) Move AkkaRpcEnv to a separate project and remove Akka from the dependencies of Core
Shixiong Zhu created SPARK-7995: --- Summary: Move AkkaRpcEnv to a separate project and remove Akka from the dependencies of Core Key: SPARK-7995 URL: https://issues.apache.org/jira/browse/SPARK-7995 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem
Shixiong Zhu created SPARK-7996: --- Summary: Deprecate the developer api SparkEnv.actorSystem Key: SPARK-7996 URL: https://issues.apache.org/jira/browse/SPARK-7996 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7997) Remove the developer api SparkEnv.actorSystem
Shixiong Zhu created SPARK-7997: --- Summary: Remove the developer api SparkEnv.actorSystem Key: SPARK-7997 URL: https://issues.apache.org/jira/browse/SPARK-7997 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7998) A better frequent item API
Reynold Xin created SPARK-7998: -- Summary: A better frequent item API Key: SPARK-7998 URL: https://issues.apache.org/jira/browse/SPARK-7998 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The current freqItems API is really awkward to use. It returns a DataFrame with a single row, in which each value is an array of frequent items. This design doesn't work well for exploratory data analysis (running show -- when there are more than 2 or 3 frequent values, the values get cut off): {code} In [74]: df.stat.freqItems([a, b, c], 0.4).show() +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} It also doesn't work well for serious engineering, since it is hard to get the value out. We should just create a new function (so we maintain source/binary compatibility) that returns a list of list of values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7998) A better frequent item API
[ https://issues.apache.org/jira/browse/SPARK-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7998: --- Description: The current freqItems API is really awkward to use. It returns a DataFrame with a single row, in which each value is an array of frequent items. This design doesn't work well for exploratory data analysis (running show -- when there are more than 2 or 3 frequent values, the values get cut off): {code} In [74]: df.stat.freqItems([a, b, c], 0.4).show() +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} It also doesn't work well for serious engineering, since it is hard to get the value out. We should create a new function (so we maintain source/binary compatibility) that returns a list of list of values. was: The current freqItems API is really awkward to use. It returns a DataFrame with a single row, in which each value is an array of frequent items. This design doesn't work well for exploratory data analysis (running show -- when there are more than 2 or 3 frequent values, the values get cut off): {code} In [74]: df.stat.freqItems([a, b, c], 0.4).show() +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} It also doesn't work well for serious engineering, since it is hard to get the value out. We should just create a new function (so we maintain source/binary compatibility) that returns a list of list of values. A better frequent item API -- Key: SPARK-7998 URL: https://issues.apache.org/jira/browse/SPARK-7998 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The current freqItems API is really awkward to use. It returns a DataFrame with a single row, in which each value is an array of frequent items. This design doesn't work well for exploratory data analysis (running show -- when there are more than 2 or 3 frequent values, the values get cut off): {code} In [74]: df.stat.freqItems([a, b, c], 0.4).show() +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} It also doesn't work well for serious engineering, since it is hard to get the value out. We should create a new function (so we maintain source/binary compatibility) that returns a list of list of values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7993: --- Labels: starter (was: ) Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7993: --- Description: 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} was: 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ {code} Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7999) Graph complement
Tarek Auel created SPARK-7999: - Summary: Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Improvement Reporter: Tarek Auel Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7999) Graph complement
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarek Auel updated SPARK-7999: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-7893 Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Sub-task Reporter: Tarek Auel Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7999) Graph complement
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarek Auel updated SPARK-7999: -- Description: This task is for implementing the complement operation (compare to parent task). http://techieme.in/complex-graph-operations/ Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Sub-task Reporter: Tarek Auel Priority: Minor This task is for implementing the complement operation (compare to parent task). http://techieme.in/complex-graph-operations/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7999) Graph complement
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566999#comment-14566999 ] Tarek Auel commented on SPARK-7999: --- I would propose def complement(attr: ED): Graph[VD, ED] as interface Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Sub-task Reporter: Tarek Auel Priority: Minor This task is for implementing the complement operation (compare to parent task). http://techieme.in/complex-graph-operations/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7999) Graph complement
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566999#comment-14566999 ] Tarek Auel edited comment on SPARK-7999 at 6/1/15 7:04 AM: --- I would propose def complement(attr: ED, selfLoops: Boolean = false): Graph[VD, ED] as interface. The self-loop parameter defines whether self loops (A--A) should be created or not. was (Author: tarekauel): I would propose def complement(attr: ED): Graph[VD, ED] as interface Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Sub-task Reporter: Tarek Auel Priority: Minor This task is for implementing the complement operation (compare to parent task). http://techieme.in/complex-graph-operations/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567207#comment-14567207 ] Apache Spark commented on SPARK-7980: - User 'animeshbaranawal' has created a pull request for this issue: https://github.com/apache/spark/pull/6553 Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567221#comment-14567221 ] Sean Owen commented on SPARK-8008: -- Isnt this what connection pooling is for? Is that an option? sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4782: --- Assignee: (was: Apache Spark) Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4782: --- Assignee: Apache Spark Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Assignee: Apache Spark Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8011) DecimalType is not a datatype
Bipin Roshan Nag created SPARK-8011: --- Summary: DecimalType is not a datatype Key: SPARK-8011 URL: https://issues.apache.org/jira/browse/SPARK-8011 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.1 Reporter: Bipin Roshan Nag When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567267#comment-14567267 ] Steven W edited comment on SPARK-5389 at 6/1/15 12:56 PM: -- I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed JDK 8 and it worked after that. So, I think else was unexpected at this time. just shows up anytime Java can't run. (Spark 1.3.1, Java 6u45) was (Author: sjwoodard): I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed JDK 8 and it worked after that. So, I think else was unexpected at this time. just shows up anytime Java can't run. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567230#comment-14567230 ] Rene Treffer commented on SPARK-8008: - At the moment each partition uses it's own connection as far as I can tell, I have to double check how this works on a cluster where even multiple server might fetch data. I'm currently loading year+month wise, due to DB schema (index on actual days, locality based on year/month). I don't think larger batches would be an solution. 3 months may require 160Mio rows. I don't think batching that into one partition is a good idea. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7890) Document that Spark 2.11 now supports Kafka
[ https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567167#comment-14567167 ] Iulian Dragos commented on SPARK-7890: -- [~srowen] thanks for fixing it, and sorry for being unresponsive. I've been traveling a few days without a good internet connection. Document that Spark 2.11 now supports Kafka --- Key: SPARK-7890 URL: https://issues.apache.org/jira/browse/SPARK-7890 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Patrick Wendell Assignee: Sean Owen Priority: Critical Fix For: 1.4.1, 1.5.0 The building-spark.html page needs to be updated. It's a simple fix, just remove the caveat about Kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567279#comment-14567279 ] Jianshi Huang commented on SPARK-4782: -- Thanks Luca for the clever fix! I also noticed that the schema inference in JsonRDD is too JSON specific. As JSON's datatype is quite limited. Jianshi Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567254#comment-14567254 ] Luca Rosellini commented on SPARK-4782: --- Hi Jianshi, I've just hit the same problem as you, it seems quite inefficient to have to serialize to JSON when you already have a {{Map\[String,Any\]}}. I've opened a PR in github that adds this feature in a generic way, check it out at: [https://github.com/apache/spark/pull/6554]. Hopefully it will be merged in master. The patch extends {{inferSchema}} functionality to any RDD of type T for which you can provide a function mapping from {{RDD\[T\]}} to {{RDD\[Map\[String,Any\]\]}}. In your case, you already have an {{RDD\[Map\[String,Any\]\]}}, so you can simply pass the identity function, something like this: {{JsonRDD.inferSchema(json, 1.0, conf.columnNameOfCorruptRecord, \{ (a:RDD\[Map\[String,Any\]\],b:String) = a \}))}} Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567187#comment-14567187 ] Rick Moritz commented on SPARK-6816: One current drawback with SparkR's configuration option is the inability to set driver VM-options. These are crucial, when attempting to run sparkR on a Hortonworks HDP, as both driver and appliation-master need to be aware of the hdp.version variable in order to resolve the classpath. While it is possible to pass this variable to the executors, there's no way to pass this option to the driver, excepting the following exploit/work-around: The SPARK_MEM variable can be abused to pass the required parameters to the driver's VM, by using String concatenation. Setting the variable to (e.g.) 512m -Dhdp.version=NNN appends the -D option to the -X option which is currently read from this environment variable. Adding a secondary variable to the System.env which gets parsed for JVM options would be far more obvious and less hacky, or by adding a separate environment list for the driver, extending what's currently available for executors. I'm adding this as a comment to this issue, since I believe it is sufficiently closely related not to warrant a separate issue. Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567219#comment-14567219 ] Michael Armbrust commented on SPARK-8008: - I'm okay adding documentation about this behavior where ever you think it would help, but I would say this is by design. I'd suggest that if you want lower concurrency use fewer partitions to extract the data and then {{repartition}} if you need higher concurrency for subsequent operations. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567237#comment-14567237 ] Apache Spark commented on SPARK-4782: - User 'lucarosellini' has created a pull request for this issue: https://github.com/apache/spark/pull/6554 Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567227#comment-14567227 ] Michael Armbrust commented on SPARK-8008: - I think connection pooling is used primarily to avoid the overhead of making a new connection for each operation. In the case of extracting large amounts of data, the user may actually want multiple concurrent connections from the same machine. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567016#comment-14567016 ] Saisai Shao commented on SPARK-4352: Hi [~sandyr], thanks a lot for your suggestion. IIUC the algorithm you describe is trying to make the executor request be proportional to the node preference, say your desired tasks on cluster are 3 : 3 : 2 : 1, so you're trying to allocate the executors following this, but I'm curious about algorithm on 7 and 18 situation, what you describe is: requests for 5 executors with nodes = a, b, c, d requests for 2 executors with nodes = a, b, c that is 7 : 7 : 7 : 5 is it better like this: requests for 2 executors with nodes = a, b, c, d requests for 2 executors with nodes = a, b, c requests for 3 executors with nodes = a, b here is 7 : 7 : 4 : 2 Also for 18 situation, why not: requests for 6 executors with nodes = a, b, c, d requests for 6 executors with nodes = a, b, c requests for 6 executors with nodes = a, b Would you please help to explain it, maybe I missed in some places:). Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567020#comment-14567020 ] Reynold Xin commented on SPARK-7993: Thanks. Note that once you change the show output, you might need to update some Python unit tests since some of the functions use show's output. Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567234#comment-14567234 ] Michael Armbrust commented on SPARK-8008: - What is the problem with large partitions (as long as you aren't caching them, where there is a 2GB limit)? sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567240#comment-14567240 ] Rene Treffer commented on SPARK-8008: - I've seen very poor performance when streaming it as one partition for example (WHERE 1=1). I'll retry with different partition counts. But I still think there should be a warning about the behavior, as I didn't naturally understand that partition count == parallelism in this case (although it's logical after some thinking). sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8011) DecimalType is not a datatype
[ https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bipin Roshan Nag updated SPARK-8011: Description: When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) was: When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) DecimalType is not a datatype - Key: SPARK-8011 URL: https://issues.apache.org/jira/browse/SPARK-8011 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.1 Reporter: Bipin Roshan Nag When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567267#comment-14567267 ] Steven W commented on SPARK-5389: - I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed JDK 8 and it worked after that. So, I think else was unexpected at this time. just shows up anytime Java can't run. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8001) Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
Shixiong Zhu created SPARK-8001: --- Summary: Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout Key: SPARK-8001 URL: https://issues.apache.org/jira/browse/SPARK-8001 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor TimeoutException is a more explicit failure. In addition, the caller may forget to call {{assert}} to check the return value of {{AsynchronousListenerBus.waitUntilEmpty}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8005) Support INPUT__FILE__NAME virtual column
Reynold Xin created SPARK-8005: -- Summary: Support INPUT__FILE__NAME virtual column Key: SPARK-8005 URL: https://issues.apache.org/jira/browse/SPARK-8005 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin INPUT__FILE__NAME: input file name. One way to do this is to do it through a thread local variable in the SqlNewHadoopRDD.scala, and read that thread local variable in an expression. (similar to SparkPartitionID expression) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8006) Support BLOCK__OFFSET__INSIDE__FILE virtual column
Reynold Xin created SPARK-8006: -- Summary: Support BLOCK__OFFSET__INSIDE__FILE virtual column Key: SPARK-8006 URL: https://issues.apache.org/jira/browse/SPARK-8006 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin See Hive's semantics: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135 ] Akhil Thatipamula commented on SPARK-7993: -- [~rxin] Does the 3rd modification effect 'List' as well. For instance, ++ |modules| ++ |List(mllib, sql, ...| ++ should it be ++ | modules| ++ | [mllib, sql, ...| ++ ? Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7980: --- Assignee: (was: Apache Spark) Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7246) Rank for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567023#comment-14567023 ] Reynold Xin commented on SPARK-7246: This is done now with window functions right? Rank for DataFrames --- Key: SPARK-7246 URL: https://issues.apache.org/jira/browse/SPARK-7246 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Xiangrui Meng `rank` maps a numeric column to a long column with rankings. `rank` should be an expression. Where it lives is TBD. One suggestion is `funcs.stat`. {code} df.select(name, rank(time)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8002) Support virtual columns in SQL and DataFrames
Reynold Xin created SPARK-8002: -- Summary: Support virtual columns in SQL and DataFrames Key: SPARK-8002 URL: https://issues.apache.org/jira/browse/SPARK-8002 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8003) Support SPARK__PARTITION__ID in SQL
[ https://issues.apache.org/jira/browse/SPARK-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8003: --- Description: SPARK__PARTITION__ID column should return the partition index of the Spark partition. Note that we already have a DataFrame function for it: https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705 was: PARTITION__ID column should return the partition index of the Spark partition. Note that we already have a DataFrame function for it: https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705 Support SPARK__PARTITION__ID in SQL --- Key: SPARK-8003 URL: https://issues.apache.org/jira/browse/SPARK-8003 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SPARK__PARTITION__ID column should return the partition index of the Spark partition. Note that we already have a DataFrame function for it: https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8007: --- Description: Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} was:Create the infrastructure so we can resolve df(SPARK_PARTITION__ID) to SparkPartitionID expression. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
Rene Treffer created SPARK-8008: --- Summary: sqlContext.jdbc can kill your database due to high concurrency Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567017#comment-14567017 ] Akhil Thatipamula commented on SPARK-7993: -- [~rxin] I will work on this. Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8009) [Mesos] Allow provisioning of executor logging configuration
Gerard Maas created SPARK-8009: -- Summary: [Mesos] Allow provisioning of executor logging configuration Key: SPARK-8009 URL: https://issues.apache.org/jira/browse/SPARK-8009 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 1.3.1 Environment: Mesos executor Reporter: Gerard Maas It's currently not possible to provide a custom logging configuration for the Mesos executors. Upon startup of the executor JVM, it loads a default config file from the Spark assembly, visible by this line in stderr: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties That line comes from Logging.scala [1] where a default config is loaded if none is found in the classpath upon the startup of the Spark Mesos executor in the Mesos sandbox. At that point in time, none of the application-specific resources have been shipped yet, as the executor JVM is just starting up. To load a custom configuration file we should have it already on the sandbox before the executor JVM starts and add it to the classpath on the startup command. For the classpath customization, It looks like it should be possible to pass a -Dlog4j.configuration property by using the 'spark.executor.extraClassPath' that will be picked up at [2] and that should be added to the command that starts the executor JVM, but the resource must be already on the host before we can do that. Therefore we need some means of 'shipping' the log4j.configuration file to the allocated executor. This all boils down to the need of shipping extra files to the sandbox. There's a workaround: open up the Spark assembly, replace the log4j-default.properties and pack it up again. That would work, although kind of rudimentary as people may use the same assembly for many jobs. Probably, accessing the log4j API programmatically should also work (we didn't try that yet) [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7798) Move AkkaRpcEnv to a separate project
[ https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567096#comment-14567096 ] Sean Owen commented on SPARK-7798: -- What do you mean by separate project? I don't think this warrants its own module. Can this please be combined with other move, deprecate and remove JIRAs? we don't need three of them. Move AkkaRpcEnv to a separate project --- Key: SPARK-7798 URL: https://issues.apache.org/jira/browse/SPARK-7798 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567105#comment-14567105 ] Apache Spark commented on SPARK-8010: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/6551 Implict promote Numeric type to String type in HiveTypeCoercion --- Key: SPARK-8010 URL: https://issues.apache.org/jira/browse/SPARK-8010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.3.1 Original Estimate: 48h Remaining Estimate: 48h 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type in case throw exceptions. Since Hive will always do this. It need to be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8010: --- Assignee: (was: Apache Spark) Implict promote Numeric type to String type in HiveTypeCoercion --- Key: SPARK-8010 URL: https://issues.apache.org/jira/browse/SPARK-8010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.3.1 Original Estimate: 48h Remaining Estimate: 48h 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type in case throw exceptions. Since Hive will always do this. It need to be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135 ] Akhil Thatipamula edited comment on SPARK-7993 at 6/1/15 10:25 AM: --- [~rxin] Does the 3rd modification effect 'List' as well. For instance, ++ |modules| ++ |List(mllib, sql, ...| ++ should it be? ++ | modules| ++ | [mllib, sql, ...| ++ was (Author: 6133d): [~rxin] Does the 3rd modification effect 'List' as well. For instance, ++ |modules| ++ |List(mllib, sql, ...| ++ should it be ++ | modules| ++ | [mllib, sql, ...| ++ ? Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135 ] Akhil Thatipamula edited comment on SPARK-7993 at 6/1/15 10:26 AM: --- [~rxin] Does the 3rd modification effect 'List' as well. For instance, |List(mllib, sql, ...| should it be? | [mllib, sql, ...| was (Author: 6133d): [~rxin] Does the 3rd modification effect 'List' as well. For instance, ++ |modules| ++ |List(mllib, sql, ...| ++ should it be? ++ | modules| ++ | [mllib, sql, ...| ++ Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5302) Add support for SQLContext partition columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567061#comment-14567061 ] Reynold Xin commented on SPARK-5302: [~btiernay] is this resolved now SPARK-5182 is resolved? Add support for SQLContext partition columns -- Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a column of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources
Rene Treffer created SPARK-8004: --- Summary: Spark does not enclose column names when fetchting from jdbc sources Key: SPARK-8004 URL: https://issues.apache.org/jira/browse/SPARK-8004 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark failes to load tables that have a keyword as column names Sample error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' {code} A correct query would have been {code} SELECT `key`.`value` FROM {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8003) Support PARTITION__ID in SQL
Reynold Xin created SPARK-8003: -- Summary: Support PARTITION__ID in SQL Key: SPARK-8003 URL: https://issues.apache.org/jira/browse/SPARK-8003 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin PARTITION__ID column should return the partition index of the Spark partition. Note that we already have a DataFrame function for it: https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8010: --- Assignee: Apache Spark Implict promote Numeric type to String type in HiveTypeCoercion --- Key: SPARK-8010 URL: https://issues.apache.org/jira/browse/SPARK-8010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Assignee: Apache Spark Fix For: 1.3.1 Original Estimate: 48h Remaining Estimate: 48h 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type in case throw exceptions. Since Hive will always do this. It need to be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567142#comment-14567142 ] Apache Spark commented on SPARK-7980: - User 'animeshbaranawal' has created a pull request for this issue: https://github.com/apache/spark/pull/6552 Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7980: --- Assignee: Apache Spark Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567040#comment-14567040 ] Akhil Thatipamula commented on SPARK-7993: -- Thanks for mentioning, I will take of care of that. Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8007) Support resolving virtual columns in DataFrames
Reynold Xin created SPARK-8007: -- Summary: Support resolving virtual columns in DataFrames Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK_PARTITION__ID) to SparkPartitionID expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7798) Move AkkaRpcEnv to a separate project
[ https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7798. -- Resolution: Duplicate Target Version/s: (was: 1.6.0) You've got some duplication here. I think this is a lot of noise for this one task, lots of JIRAs? can this not just be a couple steps? Move AkkaRpcEnv to a separate project --- Key: SPARK-7798 URL: https://issues.apache.org/jira/browse/SPARK-7798 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7798) Move AkkaRpcEnv to a separate project
[ https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567103#comment-14567103 ] Shixiong Zhu commented on SPARK-7798: - I want to propose one move and deprecate JIRA for 1.5, and one remove JIRA for 1.6. Thank you for pointing out this duplicate JIRA. Move AkkaRpcEnv to a separate project --- Key: SPARK-7798 URL: https://issues.apache.org/jira/browse/SPARK-7798 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion
Li Sheng created SPARK-8010: --- Summary: Implict promote Numeric type to String type in HiveTypeCoercion Key: SPARK-8010 URL: https://issues.apache.org/jira/browse/SPARK-8010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.3.1 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type in case throw exceptions. Since Hive will always do this. It need to be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8001) Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
[ https://issues.apache.org/jira/browse/SPARK-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8001: --- Assignee: (was: Apache Spark) Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout - Key: SPARK-8001 URL: https://issues.apache.org/jira/browse/SPARK-8001 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor TimeoutException is a more explicit failure. In addition, the caller may forget to call {{assert}} to check the return value of {{AsynchronousListenerBus.waitUntilEmpty}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8003) Support SPARK__PARTITION__ID in SQL
[ https://issues.apache.org/jira/browse/SPARK-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8003: --- Summary: Support SPARK__PARTITION__ID in SQL (was: Support PARTITION__ID in SQL) Support SPARK__PARTITION__ID in SQL --- Key: SPARK-8003 URL: https://issues.apache.org/jira/browse/SPARK-8003 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin PARTITION__ID column should return the partition index of the Spark partition. Note that we already have a DataFrame function for it: https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7893. -- Resolution: Duplicate I'd prefer to close this kind of overview JIRA, as it doesn't seem to contain enough to tie together sub-JIRAs. They're all graph operations, yes, but aren't part of a larger piece of work. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567036#comment-14567036 ] Reynold Xin commented on SPARK-7993: Please cc me on your pull request (my github id is @rxin) Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7999) Graph complement
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567098#comment-14567098 ] Sean Owen commented on SPARK-7999: -- So I'm not sure it's clear the parent issue is even something that would be accepted, as it's a big umbrella JIRA. I would start by reviewing https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark (the MLlib part applies) here to argue whether this should be included in GraphX on the mailing list, rather than start with a JIRA or PR. Graph complement Key: SPARK-7999 URL: https://issues.apache.org/jira/browse/SPARK-7999 Project: Spark Issue Type: Sub-task Reporter: Tarek Auel Priority: Minor This task is for implementing the complement operation (compare to parent task). http://techieme.in/complex-graph-operations/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger
Jianshi Huang created SPARK-8012: Summary: ArrayIndexOutOfBoundsException in SerializationDebugger Key: SPARK-8012 URL: https://issues.apache.org/jira/browse/SPARK-8012 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Jianshi Huang It makes NonSerializable exception less obvious. {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131) at org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:109) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at
[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD
[ https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567478#comment-14567478 ] Erik Erlandson commented on SPARK-2315: --- The 'drop' RDD methods have been made available on the 'silex' project (beginning with release 0.0.6): https://github.com/willb/silex Documentation: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.drop.DropRDDFunctions drop, dropRight and dropWhile which take RDD input and return RDD - Key: SPARK-2315 URL: https://issues.apache.org/jira/browse/SPARK-2315 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Erik Erlandson Assignee: Erik Erlandson Labels: features Last time I loaded in a text file, I found myself wanting to just skip the first element as it was a header. I wrote candidate methods drop, dropRight and dropWhile to satisfy this kind of need: val txt = sc.textFile(text_with_header.txt) val data = txt.drop(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
Jianshi Huang created SPARK-8014: Summary: DataFrame.write.mode(error).save(...) should not scan the output folder Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Priority: Minor I have code that set the wrong output location, but failed with strange errors, it scaned my ~/.Trash folder... It turned out save will scan the output folder first before mode(error) does the check. Scanning is unnecessary for mode = error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8013) Get JDBC server working with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567561#comment-14567561 ] Iulian Dragos commented on SPARK-8013: -- There's a Scala 2.11.7 milestone due in July, hopefully we can get a solution in by then. Get JDBC server working with Scala 2.11 --- Key: SPARK-8013 URL: https://issues.apache.org/jira/browse/SPARK-8013 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell Assignee: Iulian Dragos Priority: Critical It's worth some investigation here, but I believe the simplest solution is to see if we can get Scala to shade it's use of JLine to avoid JLine conflicts between Hive and the Spark repl. It's also possible that there is a simpler internal solution to the conflict (I haven't looked at it in a long time). So doing some investigation of that would be good. IIRC, there is use of Jline in our own repl code, in addition to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I couldn't harmonize all the versions in a nice way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources
[ https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567467#comment-14567467 ] Liang-Chi Hsieh commented on SPARK-8004: I think backticks are only working for MySQL? Spark does not enclose column names when fetchting from jdbc sources Key: SPARK-8004 URL: https://issues.apache.org/jira/browse/SPARK-8004 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark failes to load tables that have a keyword as column names Sample error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' {code} A correct query would have been {code} SELECT `key`.`value` FROM {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567468#comment-14567468 ] Mark Smiley commented on SPARK-5389: I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a space in the path to Scala, but that's the default installation path for Scala. Also verified the same issue on Windows 8.1. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567468#comment-14567468 ] Mark Smiley edited comment on SPARK-5389 at 6/1/15 3:54 PM: I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a space in the path to Scala, but that's the default installation path for Scala. There is no Java 6 on either of these systems. Also verified the same issue on Windows 8.1. was (Author: drfractal): I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a space in the path to Scala, but that's the default installation path for Scala. Also verified the same issue on Windows 8.1. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8013) Get JDBC server working with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8013: --- Target Version/s: 1.5.0 Get JDBC server working with Scala 2.11 --- Key: SPARK-8013 URL: https://issues.apache.org/jira/browse/SPARK-8013 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell Assignee: Iulian Dragos Priority: Critical It's worth some investigation here, but I believe the simplest solution is to see if we can get Scala to shade it's use of JLine to avoid JLine conflicts between Hive and the Spark repl. It's also possible that there is a simpler internal solution to the conflict (I haven't looked at it in a long time). So doing some investigation of that would be good. IIRC, there is use of Jline in our own repl code, in addition to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I couldn't harmonize all the versions in a nice way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8013) Get JDBC server working with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8013: --- Priority: Critical (was: Major) Get JDBC server working with Scala 2.11 --- Key: SPARK-8013 URL: https://issues.apache.org/jira/browse/SPARK-8013 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell Assignee: Iulian Dragos Priority: Critical It's worth some investigation here, but I believe the simplest solution is to see if we can get Scala to shade it's use of JLine to avoid JLine conflicts between Hive and the Spark repl. It's also possible that there is a simpler internal solution to the conflict (I haven't looked at it in a long time). So doing some investigation of that would be good. IIRC, there is use of Jline in our own repl code, in addition to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I couldn't harmonize all the versions in a nice way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8013) Get JDBC server working with Scala 2.11
Patrick Wendell created SPARK-8013: -- Summary: Get JDBC server working with Scala 2.11 Key: SPARK-8013 URL: https://issues.apache.org/jira/browse/SPARK-8013 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell Assignee: Iulian Dragos It's worth some investigation here, but I believe the simplest solution is to see if we can get Scala to shade it's use of JLine to avoid JLine conflicts between Hive and the Spark repl. It's also possible that there is a simpler internal solution to the conflict (I haven't looked at it in a long time). So doing some investigation of that would be good. IIRC, there is use of Jline in our own repl code, in addition to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I couldn't harmonize all the versions in a nice way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros
[ https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567520#comment-14567520 ] Karl Higley commented on SPARK-7857: This is addressed by the addition of numNonZeros in SPARK-6756. IDF w/ minDocFreq on SparseVectors results in literal zeros --- Key: SPARK-7857 URL: https://issues.apache.org/jira/browse/SPARK-7857 Project: Spark Issue Type: Bug Components: MLlib Reporter: Karl Higley Priority: Minor When the IDF model's minDocFreq parameter is set to a non-zero threshold, the IDF for any feature below that threshold is set to zero. When the model is used to transform a set of SparseVectors containing that feature, the resulting SparseVectors contain entries whose values are zero. The zero entries should be omitted in order to simplify downstream processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7987) TransportContext.createServer(int port) is missing in Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567556#comment-14567556 ] Marcelo Vanzin commented on SPARK-7987: --- [~joshrosen] that annotation is nice but it cannot live in {{core/}} if this module is to use it. Actually, it would be really nice to have a new top-level module for these annotations and other very generic helper code (such as JavaUtils.java, which is used in more than the network module). TransportContext.createServer(int port) is missing in Spark 1.4 --- Key: SPARK-7987 URL: https://issues.apache.org/jira/browse/SPARK-7987 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.4.0 Reporter: Patrick Wendell Priority: Blocker From what I can tell the SPARK-6229 patch removed this API: https://github.com/apache/spark/commit/38d4e9e446b425ca6a8fe8d8080f387b08683842#diff-d9d4b8d8e82b7d96d5e779353e4b2f4eL85 I think adding it back should be easy enough, but I cannot figure out why this didn't trigger MIMA errors. I am wondering if MIMA was not enabled properly for some of the new modules: /cc [~vanzin] [~rxin] and [~adav] I put this as a blocker level issue because I'm wondering if we just aren't enforcing checks for some reason in some of our API's. So I think we need to block the 1.4 release on at least making sure no other serious API's were broken. If it turns out only this API was affected, or I'm just missing something, we can downgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567345#comment-14567345 ] Steve Loughran commented on SPARK-4352: --- As usual, when YARN-1042 is done, life gets easier: the AM asks YARN for the anti-affine placement. If you look at how other YARN clients have implemented anti-affinity (TWILL-82), the blacklist is used to block off all nodes in use, with a request-at-a-time ramp-up to avoid 1 outstanding request being granted on the same node. As well as anti-affinity, life would be even better with dynamic container resize: if a single executor could expand/relax CPU capacity on demand, you'd only need one per node and then handle multiple tasks by running more work there. (This does nothing for RAM consumption though) now, for some other fun, # you may want to consider which surplus containers to release, both outstanding requests and actually granted. In particular, if you want to cancel 1 outstanding request, which to choose? Any of them? The newest? The oldest? The node with the worst reliability statistics? Killing the newest works if you assume that the older containers have generated more host-local data that you wish to reuse. # history may also be a factor in placement. If you are starting a session which continues/extends previous work, the previous location of the executors may be the first locality clue. Ask for containers on those nodes and there's a high likelihood that all the output data from the previous session will be stored locally on one of the nodes a container is assigned. # Testing. There aren't any, are there? It's possible to simulate some of the basic operations, you just need to isolate the code which examines the application state and generates container request/release events from the actual interaction with the RM. I've done this before with the request to allocate/cancel [generating a list of operations to be submitted or simulated|https://github.com/apache/incubator-slider/blob/develop/slider-core/src/main/java/org/apache/slider/server/appmaster/state/AppState.java#L1908]. When combined with a [mock YARN engine|https://github.com/apache/incubator-slider/tree/develop/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/mock], let us do things like [test historical placement logic|https://github.com/apache/incubator-slider/tree/develop/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history] as well as whether to re-request containers on nodes where containers have just recently failed. While that mock stuff isn't that realistic, it can be used to test basic placement and failure handling logic. More succinctly: you can write tests for this stuff by splitting request generation from the API calls testing the request/release logic standalone Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567389#comment-14567389 ] Deepak Kumar V commented on SPARK-4105: --- I see this issue when reading sequence file stored in Sequence File format (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v? ) All i do is sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new org.apache.spark.HashPartitioner(2053)) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize).get) .set(spark.kryoserializer.buffer.max.mb, arguments.get(maxbuffersize).get) .set(spark.driver.maxResultSize, arguments.get(maxResultSize).get) .set(spark.yarn.maxAppAttempts, 0) //.set(spark.akka.askTimeout, arguments.get(askTimeout).get) //.set(spark.akka.timeout, arguments.get(akkaTimeout).get) //.set(spark.worker.timeout, arguments.get(workerTimeout).get) .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum])) and values are buffersize=128 maxbuffersize=1068 maxResultSize=200G FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: JavaObjectToSerialize.java, SparkFailedToUncompressGenerator.scala We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at
[jira] [Created] (SPARK-8015) flume-sink should not depend on Guava.
Marcelo Vanzin created SPARK-8015: - Summary: flume-sink should not depend on Guava. Key: SPARK-8015 URL: https://issues.apache.org/jira/browse/SPARK-8015 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Minor The flume-sink module, due to the shared shading code in our build, ends up depending on the {{org.spark-project}} Guava classes. That means users who deploy the sink in Flume will also need to provide those classes somehow, generally by also adding the Spark assembly, which means adding a whole bunch of other libraries to Flume, which may or may not cause other unforeseen problems. It's better to not have that dependency in the flume-sink module instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
[ https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8014: -- Description: When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then execute the following code in Spark shell: {code} import sqlContext._ import sqlContext.implicits._ Seq(1 - a).toDF(i, s).write.format(parquet).mode(error).save(file:///tmp/foo) {code} From the exception stack trace we can see that metadata discovery code path is executed: {noformat} java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) ... Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} was: When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then DataFrame.write.mode(error).save(...) should not scan the output folder - Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Priority: Minor When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then execute the following code in Spark shell: {code} import sqlContext._ import sqlContext.implicits._ Seq(1 - a).toDF(i, s).write.format(parquet).mode(error).save(file:///tmp/foo) {code} From the exception stack trace we can see that metadata discovery code path is executed: {noformat} java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) ... Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is
[jira] [Updated] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
[ https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8014: -- Priority: Major (was: Minor) DataFrame.write.mode(error).save(...) should not scan the output folder - Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then execute the following code in Spark shell: {code} import sqlContext._ import sqlContext.implicits._ Seq(1 - a).toDF(i, s).write.format(parquet).mode(error).save(file:///tmp/foo) {code} From the exception stack trace we can see that metadata discovery code path is executed: {noformat} java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) ... Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567601#comment-14567601 ] Yana Kadiyska commented on SPARK-5389: -- FWIW I just tried the 1.4-rc3 build (http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/) cdh4 binary and it runs without issues. From the exact same command prompt I can run the 1.4 script but not the 1.2 script. So if we can't figure out a consistent repro, maybe other folks can confirm if the new cmd files work... spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
[ https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-8014: - Assignee: Cheng Lian DataFrame.write.mode(error).save(...) should not scan the output folder - Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Assignee: Cheng Lian When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then execute the following code in Spark shell: {code} import sqlContext._ import sqlContext.implicits._ Seq(1 - a).toDF(i, s).write.format(parquet).mode(error).save(file:///tmp/foo) {code} From the exception stack trace we can see that metadata discovery code path is executed: {noformat} java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) ... Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8016) YARN cluster / client modes have different app names for python
Andrew Or created SPARK-8016: Summary: YARN cluster / client modes have different app names for python Key: SPARK-8016 URL: https://issues.apache.org/jira/browse/SPARK-8016 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Reporter: Andrew Or See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready
[ https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567735#comment-14567735 ] Shivaram Venkataraman commented on SPARK-7909: -- Yeah feel free to open a PR for the `print` fixes. spark-ec2 and associated tools not py3 ready Key: SPARK-7909 URL: https://issues.apache.org/jira/browse/SPARK-7909 Project: Spark Issue Type: Improvement Components: EC2 Environment: ec2 python3 Reporter: Matthew Goodman At present there is not a possible permutation of tools that supports Python3 on both the launching computer and running cluster. There are a couple problems involved: - There is no prebuilt spark binary with python3 support. - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements - Config files for cluster processes don't seem to make it to all nodes in a working format. I have fixes for some of this, but the config and running context debugging remains elusive to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567755#comment-14567755 ] Sandy Ryza commented on SPARK-4352: --- [~jerryshao] I wouldn't say that the goal is necessarily to get as close as possible to the ratio of requests (3 : 3 : 2 : 1 in the example). My idea was to get as close as possible to sum(cores from all executor requests with that node on their preferred list) = number tasks that prefer that node. Why? Let's look at the situation where we're requesting 18 executors. Let's say we request 6 executors with a preference for a, b, c, d like you suggested. YARN would be perfectly happy giving us 6 executors on node d. But we only have 10 tasks (with executors that have 2 cores, this means 5 executors) that need to run on node d. So we'd really prefer that the 6th executor be scheduled on a, b, or c, because placing it on d confers no additional advantage. For the situation where we're requesting 7 executors I have less of an argument for why my 5 : 2 is better than your 2 : 2 : 3. Thinking about it more now, it seems like your approach could be closer to optimal because getting executors on a or b means more of our tasks get to run on local data. So I would certainly be open to something that tries to preserve the ratio when the number of executors we're allowed to request is under the maximum number of tasks targeted for any particular node. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567757#comment-14567757 ] Sean Owen commented on SPARK-8008: -- I suppose I meant you can block waiting on a new connection after the max is hit instead of opening far too many. sqlContext.jdbc can kill your database due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready
[ https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567685#comment-14567685 ] Matthew Goodman commented on SPARK-7909: Awesome, thanks for all the help on this. There is one (possibly unrelated) issue remains, which is that httpd seems to fail to startup, giving the following traceback: {code:title=HTTPD Failure Traceback|borderStyle=solid} Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory {code} Should I send in a PR [for this change|https://github.com/3Scan/spark-ec2/commit/3416dd07c492b0cddcc98c4fa83f9e4284ed8fc9]? spark-ec2 and associated tools not py3 ready Key: SPARK-7909 URL: https://issues.apache.org/jira/browse/SPARK-7909 Project: Spark Issue Type: Improvement Components: EC2 Environment: ec2 python3 Reporter: Matthew Goodman At present there is not a possible permutation of tools that supports Python3 on both the launching computer and running cluster. There are a couple problems involved: - There is no prebuilt spark binary with python3 support. - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements - Config files for cluster processes don't seem to make it to all nodes in a working format. I have fixes for some of this, but the config and running context debugging remains elusive to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8016) YARN cluster / client modes have different app names for python
[ https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8016: - Component/s: PySpark YARN cluster / client modes have different app names for python --- Key: SPARK-8016 URL: https://issues.apache.org/jira/browse/SPARK-8016 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Reporter: Andrew Or Attachments: python.png See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4048) Enhance and extend hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567752#comment-14567752 ] Marcelo Vanzin commented on SPARK-4048: --- That is not a regression. The whole point of hadoop-provided is that *you* have to provide the needed jars. So if a jar is missing, you are failing to provide them. Enhance and extend hadoop-provided profile -- Key: SPARK-4048 URL: https://issues.apache.org/jira/browse/SPARK-4048 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.3.0 The hadoop-provided profile is used to not package Hadoop dependencies inside the Spark assembly. It works, sort of, but it could use some enhancements. A quick list: - It doesn't include all things that could be removed from the assembly - It doesn't work well when you're publishing artifacts based on it (SPARK-3812 fixes this) - There are other dependencies that could use similar treatment: Hive, HBase (for the examples), Flume, Parquet, maybe others I'm missing at the moment. - Unit tests, more specifically, those that use local-cluster mode, do not work when the assembly is built with this profile enabled. - The scripts to launch Spark jobs do not add needed provided jars to the classpath when this profile is enabled, leaving it for people to figure that out for themselves. - The examples assembly duplicates a lot of things in the main assembly. Part of this task is selfish since we build internally with this profile and we'd like to make it easier for us to merge changes without having to keep too many patches on top of upstream. But those feel like good improvements to me, regardless. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8017) YARN cluster python --py-files does not work
Andrew Or created SPARK-8017: Summary: YARN cluster python --py-files does not work Key: SPARK-8017 URL: https://issues.apache.org/jira/browse/SPARK-8017 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Reporter: Andrew Or When I ran the following, it works in client mode but not cluster mode {code} bin/spark-submit --master yarn --deploy-mode X --py-files secondary.py app.py {code} where app.py depends on secondary.py. Python YARN cluster mode is added recently so this is not a blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8015) flume-sink should not depend on Guava.
[ https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8015: --- Assignee: (was: Apache Spark) flume-sink should not depend on Guava. -- Key: SPARK-8015 URL: https://issues.apache.org/jira/browse/SPARK-8015 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Minor The flume-sink module, due to the shared shading code in our build, ends up depending on the {{org.spark-project}} Guava classes. That means users who deploy the sink in Flume will also need to provide those classes somehow, generally by also adding the Spark assembly, which means adding a whole bunch of other libraries to Flume, which may or may not cause other unforeseen problems. It's better to not have that dependency in the flume-sink module instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org