[jira] [Commented] (SPARK-12838) fix a bug in PythonRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-12838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103012#comment-15103012 ] Apache Spark commented on SPARK-12838: -- User 'zhagnlu' has created a pull request for this issue: https://github.com/apache/spark/pull/10785 > fix a bug in PythonRDD.scala > -- > > Key: SPARK-12838 > URL: https://issues.apache.org/jira/browse/SPARK-12838 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: zhanglu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12856) speed up hashCode of unsafe array
[ https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12856: Assignee: Apache Spark > speed up hashCode of unsafe array > - > > Key: SPARK-12856 > URL: https://issues.apache.org/jira/browse/SPARK-12856 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12856) speed up hashCode of unsafe array
[ https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12856: Assignee: (was: Apache Spark) > speed up hashCode of unsafe array > - > > Key: SPARK-12856 > URL: https://issues.apache.org/jira/browse/SPARK-12856 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12856) speed up hashCode of unsafe array
[ https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103011#comment-15103011 ] Apache Spark commented on SPARK-12856: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10784 > speed up hashCode of unsafe array > - > > Key: SPARK-12856 > URL: https://issues.apache.org/jira/browse/SPARK-12856 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12838) fix a bug in PythonRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-12838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhanglu updated SPARK-12838: Component/s: Spark Core Summary: fix a bug in PythonRDD.scala (was: fix a problem in PythonRDD.scala ) > fix a bug in PythonRDD.scala > -- > > Key: SPARK-12838 > URL: https://issues.apache.org/jira/browse/SPARK-12838 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: zhanglu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12856) speed up hashCode of unsafe array
Wenchen Fan created SPARK-12856: --- Summary: speed up hashCode of unsafe array Key: SPARK-12856 URL: https://issues.apache.org/jira/browse/SPARK-12856 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12841) UnresolvedException with cast
[ https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102996#comment-15102996 ] Apache Spark commented on SPARK-12841: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10781 > UnresolvedException with cast > - > > Key: SPARK-12841 > URL: https://issues.apache.org/jira/browse/SPARK-12841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Blocker > > {code} > val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") > df1.where(df1.col("single").cast("string").equalTo("1")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12841) UnresolvedException with cast
[ https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12841: Assignee: Apache Spark (was: Wenchen Fan) > UnresolvedException with cast > - > > Key: SPARK-12841 > URL: https://issues.apache.org/jira/browse/SPARK-12841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > > {code} > val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") > df1.where(df1.col("single").cast("string").equalTo("1")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12841) UnresolvedException with cast
[ https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12841: Assignee: Wenchen Fan (was: Apache Spark) > UnresolvedException with cast > - > > Key: SPARK-12841 > URL: https://issues.apache.org/jira/browse/SPARK-12841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Blocker > > {code} > val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") > df1.where(df1.col("single").cast("string").equalTo("1")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12840. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10777 [https://github.com/apache/spark/pull/10777] > Support passing arbitrary objects (not just expressions) into code generated > classes > > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > As of now, our code generator only allows passing Expression objects into the > generated class as arguments. In order to support whole-stage codegen (e.g. > for broadcast joins), the generated classes need to accept other types of > objects such as hash tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12855) Remove parser pluggability
Reynold Xin created SPARK-12855: --- Summary: Remove parser pluggability Key: SPARK-12855 URL: https://issues.apache.org/jira/browse/SPARK-12855 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The number of applications that are using this feature is small (as far as I know it came down from two to one as of Jan 2016). No other database systems support this feature, and it actually encourages 3rd party projects to not contribute their improvements back to Spark. We should just remove this functionality to simplify our own code base. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102920#comment-15102920 ] Steve Loughran commented on SPARK-12807: One thing to think about here is ramping up a notch and shading all the downstream dependencies in the YARN shuffle JAR. This is a JAR designed to be used in a specific place, the classpath. It now includes: netty, leveldb, some bits of com.google (in 1.6), some javax.annotation. What is also has for extra fun is a leveldb jni.so in native, as well as a netty one. This is going to be a problem; unless you can somehow isolate and shade that this shuffle JAR is going to force in a specific leveldb version on every bit of code picking up this JAR. > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102907#comment-15102907 ] Apache Spark commented on SPARK-12807: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/10782 > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12644) Basic support for vectorize/batch Parquet decoding
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12644. - Resolution: Fixed Fix Version/s: 2.0.0 > Basic support for vectorize/batch Parquet decoding > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12644) Vectorize/Batch decode parquet
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12644: Issue Type: Sub-task (was: Improvement) Parent: SPARK-12854 > Vectorize/Batch decode parquet > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12644) Basic support for vectorize/batch Parquet decoding
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12644: Summary: Basic support for vectorize/batch Parquet decoding (was: Vectorize/Batch decode parquet) > Basic support for vectorize/batch Parquet decoding > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12854) Vectorize Parquet reader
Reynold Xin created SPARK-12854: --- Summary: Vectorize Parquet reader Key: SPARK-12854 URL: https://issues.apache.org/jira/browse/SPARK-12854 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin The parquet encodings are largely designed to decode faster in batches, column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12853) Update query planner to use only bucketed reads if it is useful
Reynold Xin created SPARK-12853: --- Summary: Update query planner to use only bucketed reads if it is useful Key: SPARK-12853 URL: https://issues.apache.org/jira/browse/SPARK-12853 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12852) Support create table DDL with bucketing
Reynold Xin created SPARK-12852: --- Summary: Support create table DDL with bucketing Key: SPARK-12852 URL: https://issues.apache.org/jira/browse/SPARK-12852 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12851) Add the ability to understand tables bucketed by Hive
Reynold Xin created SPARK-12851: --- Summary: Add the ability to understand tables bucketed by Hive Key: SPARK-12851 URL: https://issues.apache.org/jira/browse/SPARK-12851 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We added bucketing functionality, but we current do not understand the bucketing properties if a table is generated by Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102883#comment-15102883 ] Steve Loughran commented on SPARK-12807: There's a PR to shade in trunk; I'm going to do a 1.6 PR too, which should be identical (initially for ease of testing that the 1.6 branch is fixed) > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12704) we may repartition a relation even it's not needed
[ https://issues.apache.org/jira/browse/SPARK-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12704. --- Resolution: Later Closing as later. We will revisit this when the time comes. > we may repartition a relation even it's not needed > -- > > Key: SPARK-12704 > URL: https://issues.apache.org/jira/browse/SPARK-12704 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > The implementation of {{HashPartitioning.compatibleWith}} has been > sub-optimal for a while. Think of the following case: > if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also > partitioned by int column `i`, logically these 2 partitionings are > compatible. However, {{HashPartitioning.compatibleWith}} will return false > for this case as the {{AttributeReference}} of column `i` between these 2 > tables have different expr ids. > With this wrong result of {{HashPartitioning.compatibleWith}}, we will go > into [this > branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390] > and may add unnecessary shuffle. > This won't impact correctness if the join keys are exactly the same with hash > partitioning keys, as there’s still an opportunity to not partition that > child in that branch: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428 > However, if the join keys are a super-set of hash partitioning keys, for > example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, > and we wanna join them using column `i, j`, logically we don't need shuffle > but in fact the 2 tables start out as partitioned only by `i` and redundantly > be repartitioned by `i, j`. > A quick fix is just set the expr id of {{AttributeReference}} to 0 before we > call {{this.semanticEquals(o)}} in {{HashPartitioning.compatibleWith}}, but > for long term, I think we need a better design than the `compatibleWith`, > `guarantees`, and `satisfies` mechanism, as it's quite complex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12848) Parse number as decimal
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102881#comment-15102881 ] Reynold Xin commented on SPARK-12848: - We discussed this more offline. Let's just switch to decimal. > Parse number as decimal > --- > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12850) Support bucket pruning (predicate pushdown for bucketed tables)
Reynold Xin created SPARK-12850: --- Summary: Support bucket pruning (predicate pushdown for bucketed tables) Key: SPARK-12850 URL: https://issues.apache.org/jira/browse/SPARK-12850 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin We now support bucketing. One optimization opportunity is to push some predicates into the scan to skip scanning files that definitely won't match the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12849) Bucketing improvements follow-up
Reynold Xin created SPARK-12849: --- Summary: Bucketing improvements follow-up Key: SPARK-12849 URL: https://issues.apache.org/jira/browse/SPARK-12849 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is a follow-up ticket for SPARK-12538 to improvement bucketing support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12394. - Resolution: Fixed Fix Version/s: 2.0.0 > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket
[ https://issues.apache.org/jira/browse/SPARK-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5292: --- Assignee: Wenchen Fan > optimize join for table that are already sharded/support for hive bucket > > > Key: SPARK-5292 > URL: https://issues.apache.org/jira/browse/SPARK-5292 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.2.0 >Reporter: gagan taneja >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Currently join do not consider the locality of the data and perform the > shuffle anyway > If the user takes the responsilbity of distributing the data based on some > hash or shared the data, spark join should be able to leverage sharding to > optimize join calculation/eliminate shuffle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12538) bucketed table support
[ https://issues.apache.org/jira/browse/SPARK-12538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12538. - Resolution: Fixed Fix Version/s: 2.0.0 > bucketed table support > -- > > Key: SPARK-12538 > URL: https://issues.apache.org/jira/browse/SPARK-12538 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > cc [~nongli] , please attach the design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12649) support reading bucketed table
[ https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12649. - Resolution: Fixed Fix Version/s: 2.0.0 > support reading bucketed table > -- > > Key: SPARK-12649 > URL: https://issues.apache.org/jira/browse/SPARK-12649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket
[ https://issues.apache.org/jira/browse/SPARK-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5292: --- Fix Version/s: 2.0.0 > optimize join for table that are already sharded/support for hive bucket > > > Key: SPARK-5292 > URL: https://issues.apache.org/jira/browse/SPARK-5292 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.2.0 >Reporter: gagan taneja > Fix For: 2.0.0 > > > Currently join do not consider the locality of the data and perform the > shuffle anyway > If the user takes the responsilbity of distributing the data based on some > hash or shared the data, spark join should be able to leverage sharding to > optimize join calculation/eliminate shuffle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11512) Bucket Join
[ https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11512: Fix Version/s: 2.0.0 > Bucket Join > --- > > Key: SPARK-11512 > URL: https://issues.apache.org/jira/browse/SPARK-11512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Sort merge join on two datasets on the file system that have already been > partitioned the same with the same number of partitions and sorted within > each partition, and we don't need to sort it again while join with the > sorted/partitioned keys > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12842) Add Hadoop 2.7 build profile
[ https://issues.apache.org/jira/browse/SPARK-12842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12842. - Resolution: Fixed Fix Version/s: 2.0.0 > Add Hadoop 2.7 build profile > > > Key: SPARK-12842 > URL: https://issues.apache.org/jira/browse/SPARK-12842 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > We should add a Hadoop 2.7 build profile so that we can automate tests > against Hadoop 2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12783: Assignee: Apache Spark (was: Wenchen Fan) > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Apache Spark >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caused by: java.io.NotSerializableException: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1 > Serialization stack: > - object not serializable (class: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: > package lang) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, > java.lang.type) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) > - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, > type: class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, > type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) > - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, > name: function, type: interface scala.Function1) > - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, > mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) > - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: > targetObject, type: class > org.apache.spark.sql.catalyst.expressions.Expression) > - object (class org.apache.spark.sql.catalyst.expressions.Invoke, > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;))) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@4c7e3aab) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;)), > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object; > - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, > name: arguments, type: interface scala.collection.Seq) > - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, > staticinvoke(class > org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface > scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > > [Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType
[jira] [Commented] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102817#comment-15102817 ] Apache Spark commented on SPARK-12783: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10781 > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Wenchen Fan >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caused by: java.io.NotSerializableException: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1 > Serialization stack: > - object not serializable (class: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: > package lang) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, > java.lang.type) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) > - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, > type: class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, > type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) > - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, > name: function, type: interface scala.Function1) > - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, > mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) > - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: > targetObject, type: class > org.apache.spark.sql.catalyst.expressions.Expression) > - object (class org.apache.spark.sql.catalyst.expressions.Invoke, > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;))) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@4c7e3aab) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;)), > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object; > - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, > name: arguments, type: interface scala.collection.Seq) > - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, > staticinvoke(class > org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface > scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),a
[jira] [Assigned] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12783: Assignee: Wenchen Fan (was: Apache Spark) > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Wenchen Fan >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caused by: java.io.NotSerializableException: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1 > Serialization stack: > - object not serializable (class: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: > package lang) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, > java.lang.type) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) > - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, > type: class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, > type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) > - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, > name: function, type: interface scala.Function1) > - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, > mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) > - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: > targetObject, type: class > org.apache.spark.sql.catalyst.expressions.Expression) > - object (class org.apache.spark.sql.catalyst.expressions.Invoke, > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;))) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@4c7e3aab) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;)), > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object; > - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, > name: arguments, type: interface scala.collection.Seq) > - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, > staticinvoke(class > org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface > scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > > [Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,
[jira] [Commented] (SPARK-12848) Parse number as decimal
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102812#comment-15102812 ] Herman van Hovell commented on SPARK-12848: --- [~Davies] We discussed the regression in the PR (https://github.com/apache/spark/pull/10745). I removed the functionality you currently ask for today (https://github.com/hvanhovell/spark/commit/7e31ee8a8ac36a600e0965ceefd297c33ffe0edc). We can revert this, the only thing is that we need to disable some Hive tests (which expect a Double). > Parse number as decimal > --- > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib
[ https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102802#comment-15102802 ] Davies Liu commented on SPARK-11219: It's nice to have, useful when you use online help in console. > Make Parameter Description Format Consistent in PySpark.MLlib > - > > Key: SPARK-11219 > URL: https://issues.apache.org/jira/browse/SPARK-11219 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > There are several different formats for describing params in PySpark.MLlib, > making it unclear what the preferred way to document is, i.e. vertical > alignment vs single line. > This is to agree on a format and make it consistent across PySpark.MLlib. > Following the discussion in SPARK-10560, using 2 lines with an indentation is > both readable and doesn't lead to changing many lines when adding/removing > parameters. If the parameter uses a default value, put this in parenthesis > in a new line under the description. > Example: > {noformat} > :param stepSize: > Step size for each iteration of gradient descent. > (default: 0.1) > :param numIterations: > Number of iterations run for each batch of data. > (default: 50) > {noformat} > h2. Current State of Parameter Description Formating > h4. Classification > * LogisticRegressionModel - single line descriptions, fix indentations > * LogisticRegressionWithSGD - vertical alignment, sporatic default values > * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values > * SVMModel - single line > * SVMWithSGD - vertical alignment, sporatic default values > * NaiveBayesModel - single line > * NaiveBayes - single line > h4. Clustering > * KMeansModel - missing param description > * KMeans - missing param description and defaults > * GaussianMixture - vertical align, incorrect default formatting > * PowerIterationClustering - single line with wrapped indentation, missing > defaults > * StreamingKMeansModel - single line wrapped > * StreamingKMeans - single line wrapped, missing defaults > * LDAModel - single line > * LDA - vertical align, mising some defaults > h4. FPM > * FPGrowth - single line > * PrefixSpan - single line, defaults values in backticks > h4. Recommendation > * ALS - does not have param descriptions > h4. Regression > * LabeledPoint - single line > * LinearModel - single line > * LinearRegressionWithSGD - vertical alignment > * RidgeRegressionWithSGD - vertical align > * IsotonicRegressionModel - single line > * IsotonicRegression - single line, missing default > h4. Tree > * DecisionTree - single line with vertical indentation, missing defaults > * RandomForest - single line with wrapped indent, missing some defaults > * GradientBoostedTrees - single line with wrapped indent > NOTE > This issue will just focus on model/algorithm descriptions, which are the > largest source of inconsistent formatting > evaluation.py, feature.py, random.py, utils.py - these supporting classes > have param descriptions as single line, but are consistent so don't need to > be changed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12848) Parse number as decimal
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102797#comment-15102797 ] Davies Liu commented on SPARK-12848: [~hvanhovell] The `BD` tag only works in Hive, other databases (MySQL, PostgreSQL, Impala etc) does not need this tag to get decimal work correctly. The reason I create this JIRA as subtask is that the previous SQL parser can handler this, but new parser can't (kind of a regression). > Parse number as decimal > --- > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12848) Parse number as decimal
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102763#comment-15102763 ] Herman van Hovell commented on SPARK-12848: --- Assuming that we are talking about literals here. It is quite easy to change the parse defaults for that. The way it is currently done is that when we find a decimal number, {{1.23}} for example, we will convert it into a Double (always). When a user needs a Decimal, he (or she) can use a BigDecimal literal for this by tagging the number with {{BD}}. [~davies] I might not be getting the point you are making, but I think we have covered this by using BigDecimal literals. Could you provide an example otherwise? > Parse number as decimal > --- > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime
[ https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102749#comment-15102749 ] Jason C Lee commented on SPARK-12683: - Looks like collect() eventually calls py4j's collectToPython, which then returns the port that contains the wrong answer in the socket. I am not all that familiar with how py4j works. Any expert of py4j is welcome here! > SQL timestamp is wrong when accessed as Python datetime > --- > > Key: SPARK-12683 > URL: https://issues.apache.org/jira/browse/SPARK-12683 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.1, 1.5.2, 1.6.0 > Environment: Windows 7 Pro x64 > Python 3.4.3 > py4j 0.9 >Reporter: Gerhard Fiedler > Attachments: spark_bug_date.py > > > When accessing SQL timestamp data through {{.show()}}, it looks correct, but > when accessing it (as Python {{datetime}}) through {{.collect()}}, it is > wrong. > {code} > from datetime import datetime > from pyspark import SparkContext > from pyspark.sql import SQLContext > if __name__ == "__main__": > spark_context = SparkContext(appName='SparkBugTimestampHour') > sql_context = SQLContext(spark_context) > sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts""" > data_frame = sql_context.sql(sql_text) > data_frame.show(truncate=False) > # Result from .show() (as expected, looks correct): > # +--+ > # |ts| > # +--+ > # |2100-09-09 12:11:10.09| > # +--+ > rows = data_frame.collect() > row = rows[0] > ts = row[0] > print('ts={ts}'.format(ts=ts)) > # Expected result from this print statement: > # ts=2100-09-09 12:11:10.09 > # > # Actual, wrong result (note the hours being 18 instead of 12): > # ts=2100-09-09 18:11:10.09 > # > # This error seems to be dependent on some characteristic of the system. > We couldn't reproduce > # this on all of our systems, but it is not clear what the differences > are. One difference is > # the processor: it failed on Intel Xeon E5-2687W v2. > assert isinstance(ts, datetime) > assert ts.year == 2100 and ts.month == 9 and ts.day == 9 > assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 9 > if ts.hour != 12: > print('hour is not correct; should be 12, is actually > {hour}'.format(hour=ts.hour)) > spark_context.stop() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102740#comment-15102740 ] Shixiong Zhu commented on SPARK-12847: -- Ah, I think this one should be a sub-task. Let me change it. > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12847: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-12140 > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12848) Parse number as decimal
Davies Liu created SPARK-12848: -- Summary: Parse number as decimal Key: SPARK-12848 URL: https://issues.apache.org/jira/browse/SPARK-12848 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Right now, Hive parser will parse 1.23 as double, when it's used with decimal columns, you will turn the decimal into double, lose the precision. We should follow most database had done, parse 1.23 as double, it will be converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102729#comment-15102729 ] Apache Spark commented on SPARK-12807: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/10780 > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102730#comment-15102730 ] Marcelo Vanzin commented on SPARK-12847: Kinda the same as SPARK-12140. > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12807: Assignee: Apache Spark > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12847: Assignee: Apache Spark (was: Shixiong Zhu) > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12807: Assignee: (was: Apache Spark) > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102728#comment-15102728 ] Apache Spark commented on SPARK-12847: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10779 > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
[ https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12847: Assignee: Shixiong Zhu (was: Apache Spark) > Remove StreamingListenerBus and post all Streaming events to the same thread > as Spark events > > > Key: SPARK-12847 > URL: https://issues.apache.org/jira/browse/SPARK-12847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SparkListener.onOtherEvent was added in > https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch > SQL special events instead of creating a new separate listener bus. > Streaming can also use the similar approach to eliminate the > StreamingListenerBus. Right now, nondeterministic message order in two > listener buses are really tricky when someone implements both SparkListener > and StreamingListener. And if we can use only one listener bus in Spark, the > nondeterministic message order will be eliminated and we can also remove a > lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12833) Initial import of databricks/spark-csv
[ https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102726#comment-15102726 ] Apache Spark commented on SPARK-12833: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/10778 > Initial import of databricks/spark-csv > -- > > Key: SPARK-12833 > URL: https://issues.apache.org/jira/browse/SPARK-12833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Hossein Falaki > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
Shixiong Zhu created SPARK-12847: Summary: Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events Key: SPARK-12847 URL: https://issues.apache.org/jira/browse/SPARK-12847 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu SparkListener.onOtherEvent was added in https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch SQL special events instead of creating a new separate listener bus. Streaming can also use the similar approach to eliminate the StreamingListenerBus. Right now, nondeterministic message order in two listener buses are really tricky when someone implements both SparkListener and StreamingListener. And if we can use only one listener bus in Spark, the nondeterministic message order will be eliminated and we can also remove a lot of codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11925. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9908 [https://github.com/apache/spark/pull/9908] > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Add PySpark missing methods and params for ml.feature > * RegexTokenizer should support setting toLowercase. > * MinMaxScalerModel should support output originalMin and originalMax. > * PCAModel should support output pc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102702#comment-15102702 ] Luciano Resende commented on SPARK-5159: [~ilovesoup] As I mentioned before, most if not all your changes have been applied via SPARK-6910 @All, I understand there is a bigger issue here, regarding data that is stored out of hive, but I would treat that as a different epic for Spark Data Security, while for this current issue, I would like us to concentrate on the remaining issue related to doAs when Kerberos is enabled. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12846: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-11806 > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task >Reporter: Jeff Zhang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
Jeff Zhang created SPARK-12846: -- Summary: Follow up SPARK-12707, Update documentation and other related code Key: SPARK-12846 URL: https://issues.apache.org/jira/browse/SPARK-12846 Project: Spark Issue Type: Improvement Reporter: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12840: Description: As of now, our code generator only allows passing Expression objects into the generated class as arguments. In order to support whole-stage codegen (e.g. for broadcast joins), the generated classes need to accept other types of objects such as hash tables. (was: Right now, we only support expression.) > Support passing arbitrary objects (not just expressions) into code generated > classes > > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > As of now, our code generator only allows passing Expression objects into the > generated class as arguments. In order to support whole-stage codegen (e.g. > for broadcast joins), the generated classes need to accept other types of > objects such as hash tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12575) Grammar parity with existing SQL parser
[ https://issues.apache.org/jira/browse/SPARK-12575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12575. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.0.0 > Grammar parity with existing SQL parser > --- > > Key: SPARK-12575 > URL: https://issues.apache.org/jira/browse/SPARK-12575 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > The new parser should be compatible with our existing SQL parser built using > Scala parser combinator. One thing that is different is how we parse time > intervals. There might be more. > Once we reach parity, we should just switch and remove the old SQL parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102662#comment-15102662 ] Steve Loughran commented on SPARK-12807: work on YARN for isolation will address this in Hadoop 2.8+. But that does nothing for Hadoop <= 2.8. Shading will do this > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102654#comment-15102654 ] Sean Owen commented on SPARK-12807: --- I see, it's only the shuffle and only 1.6, and only happens to affect the shuffle service on YARN. Spark has otherwise been using later Jackson for a while. Shading is indeed probably the best thing for all of Spark's usages. > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10538) java.lang.NegativeArraySizeException during join
[ https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102653#comment-15102653 ] Davies Liu commented on SPARK-10538: @mayxine The problem you posted is not related to this JIRA, it could be that rdd1.partitions.length * rdd2.partitions.length is overflow, if the number of partitions of two RDD are too large. > java.lang.NegativeArraySizeException during join > > > Key: SPARK-10538 > URL: https://issues.apache.org/jira/browse/SPARK-10538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Davies Liu > Attachments: java.lang.NegativeArraySizeException.png, > screenshot-1.png > > > Hi, > I've got a problem during joining tables in PySpark. (in my example 20 of > them) > I can observe that during calculation of first partition (on one of > consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) > vs on others partitions (approx. 272.5 KB / 113 record) > I can also observe that just before the crash python process going up to few > gb of RAM. > After some time there is an exception: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I'm running this on 2 nodes cluster (12 cores, 64 GB RAM) > Config: > {code} > spark.driver.memory 10g > spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC > -Dfile.encoding=UTF8 > spark.executor.memory 60g > spark.storage.memoryFraction0.05 > spark.shuffle.memoryFraction0.75 > spark.driver.maxResultSize 10g > spark.cores.max 24 > spark.kryoserializer.buffer.max 1g > spark.default.parallelism 200 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12840: Summary: Support passing arbitrary objects (not just expressions) into code generated classes (was: Support pass any object into codegen as reference) > Support passing arbitrary objects (not just expressions) into code generated > classes > > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we only support expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102647#comment-15102647 ] Steve Loughran commented on SPARK-12807: FWIW, I'm workng on shading jackson in the shuffle JAR > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102645#comment-15102645 ] Steve Loughran commented on SPARK-12807: problem is there are no guarantees that the spark versions are backwards compatible with the older version. If they come first, the NM itself may fail. > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12840) Support pass any object into codegen as reference
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102629#comment-15102629 ] Apache Spark commented on SPARK-12840: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10777 > Support pass any object into codegen as reference > - > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we only support expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12840: Assignee: Apache Spark (was: Davies Liu) > Support pass any object into codegen as reference > - > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Apache Spark > > Right now, we only support expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12840: Assignee: Davies Liu (was: Apache Spark) > Support pass any object into codegen as reference > - > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we only support expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-12149: -- Assignee: Alex Bozarth > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12716) Executor UI improvement suggestions - Totals
[ https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-12716. --- Resolution: Fixed Assignee: Alex Bozarth Fix Version/s: 2.0.0 > Executor UI improvement suggestions - Totals > > > Key: SPARK-12716 > URL: https://issues.apache.org/jira/browse/SPARK-12716 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > Splitting off the Totals portion of the parent UI improvements task, > description copied below: > I received some suggestions from a user for the /executors UI page to make it > more helpful. This gets more important when you have a really large number of > executors. > ... > Report the TOTALS in each column (do this at the TOP so no need to scroll to > the bottom, or print both at top and bottom). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function
[ https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102532#comment-15102532 ] Herman van Hovell commented on SPARK-12835: --- Thanks for that. The {{df.groupby(key).agg(avg_diff)}} is problematic. The Lag window function doesn't have any partitioning defined so it will move all data to a single thread on a single node. The {{diff}} value can also be based on dates with different keys. > StackOverflowError when aggregating over column from window function > > > Key: SPARK-12835 > URL: https://issues.apache.org/jira/browse/SPARK-12835 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Kalle Jepsen > > I am encountering a StackoverflowError with a very long traceback, when I try > to directly aggregate on a column created by a window function. > E.g. I am trying to determine the average timespan between dates in a > Dataframe column by using a window-function: > {code} > from pyspark import SparkContext > from pyspark.sql import HiveContext, Window, functions > from datetime import datetime > sc = SparkContext() > sq = HiveContext(sc) > data = [ > [datetime(2014,1,1)], > [datetime(2014,2,1)], > [datetime(2014,3,1)], > [datetime(2014,3,6)], > [datetime(2014,8,23)], > [datetime(2014,10,1)], > ] > df = sq.createDataFrame(data, schema=['ts']) > ts = functions.col('ts') > > w = Window.orderBy(ts) > diff = functions.datediff( > ts, > functions.lag(ts, count=1).over(w) > ) > avg_diff = functions.avg(diff) > {code} > While {{df.select(diff.alias('diff')).show()}} correctly renders as > {noformat} > ++ > |diff| > ++ > |null| > | 31| > | 28| > | 5| > | 170| > | 39| > ++ > {noformat} > doing {code} > df.select(avg_diff).show() > {code} throws a {{java.lang.StackOverflowError}}. > When I say > {code} > df2 = df.select(diff.alias('diff')) > df2.select(functions.avg('diff')) > {code} > however, there's no error. > Am I wrong to assume that the above should work? > I've already described the same in [this question on > stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: spark.jpg) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: t2.tar.gz) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: t1.tar.gz) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > Attachments: spark.jpg, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates to both tables
[ https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12845: --- Description: I have following issue. I'm connecting two tables with where condition {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 {code} In this code predicate is only push down to t1. To have predicates on both table I should run following query: {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234 {code} Spark should present same behaviour for both queries. was: I have following issue. I'm connecting two tables with where condition {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 {code} In this code predicate is only push down to t1. To have predicates on both table I should run following query which have no sense {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234 {code} Spark should present same behaviour for both queries. > During join Spark should pushdown predicates to both tables > --- > > Key: SPARK-12845 > URL: https://issues.apache.org/jira/browse/SPARK-12845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > I have following issue. > I'm connecting two tables with where condition > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 > {code} > In this code predicate is only push down to t1. > To have predicates on both table I should run following query: > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = > 1234 > {code} > Spark should present same behaviour for both queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12843: --- Issue Type: Bug (was: Improvement) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12845) During join Spark should pushdown predicates to both tables
Maciej Bryński created SPARK-12845: -- Summary: During join Spark should pushdown predicates to both tables Key: SPARK-12845 URL: https://issues.apache.org/jira/browse/SPARK-12845 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Maciej Bryński I have following issue. I'm connecting two tables with where condition {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 {code} In this code predicate is only push down to t1. To have predicates on both table I should run following query which have no sense {code} select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234 {code} Spark should present same behaviour for both queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12844) Spark documentation should be more precise about the algebraic properties of functions in various transformations
Jimmy Lin created SPARK-12844: - Summary: Spark documentation should be more precise about the algebraic properties of functions in various transformations Key: SPARK-12844 URL: https://issues.apache.org/jira/browse/SPARK-12844 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Jimmy Lin Priority: Minor Spark documentation should be more precise about the algebraic properties of functions in various transformations. The way the current documentation is written is potentially confusing. For example, in Spark 1.6, the scaladoc for reduce in RDD says: > Reduces the elements of this RDD using the specified commutative and > associative binary operator. This is precise and accurate. In the documentation of reduceByKey in PairRDDFunctions, on the other hand, it says: > Merge the values for each key using an associative reduce function. To be more precise, this function must also be commutative in order for the computation to be correct. Writing commutative for reduce and not reduceByKey gives the false impression that the function in the latter does not need to be commutative. The same applies to aggregateByKey. To be precise, both seqOp and combOp need to be associative (mentioned) AND commutative (not mentioned) in order for the computation to be correct. It would be desirable to fix these inconsistencies throughout the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function
[ https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102491#comment-15102491 ] Kalle Jepsen commented on SPARK-12835: -- The [traceback|http://pastebin.com/pRRCAben] really is ridiculously long. In my actual application I would have the window partitioned and the aggregation done in {{df.groupby(key).agg(avg_diff}}. Would that still be problematic with regard to performance? The error is the same there though, that's why I've chosen the more concise minimal example above. > StackOverflowError when aggregating over column from window function > > > Key: SPARK-12835 > URL: https://issues.apache.org/jira/browse/SPARK-12835 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Kalle Jepsen > > I am encountering a StackoverflowError with a very long traceback, when I try > to directly aggregate on a column created by a window function. > E.g. I am trying to determine the average timespan between dates in a > Dataframe column by using a window-function: > {code} > from pyspark import SparkContext > from pyspark.sql import HiveContext, Window, functions > from datetime import datetime > sc = SparkContext() > sq = HiveContext(sc) > data = [ > [datetime(2014,1,1)], > [datetime(2014,2,1)], > [datetime(2014,3,1)], > [datetime(2014,3,6)], > [datetime(2014,8,23)], > [datetime(2014,10,1)], > ] > df = sq.createDataFrame(data, schema=['ts']) > ts = functions.col('ts') > > w = Window.orderBy(ts) > diff = functions.datediff( > ts, > functions.lag(ts, count=1).over(w) > ) > avg_diff = functions.avg(diff) > {code} > While {{df.select(diff.alias('diff')).show()}} correctly renders as > {noformat} > ++ > |diff| > ++ > |null| > | 31| > | 28| > | 5| > | 170| > | 39| > ++ > {noformat} > doing {code} > df.select(avg_diff).show() > {code} throws a {{java.lang.StackOverflowError}}. > When I say > {code} > df2 = df.select(diff.alias('diff')) > df2.select(functions.avg('diff')) > {code} > however, there's no error. > Am I wrong to assume that the above should work? > I've already described the same in [this question on > stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102482#comment-15102482 ] Muthu Jayakumar commented on SPARK-12783: - Hello Kevin, Here is what I am seeing... from shell: {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> case class MyMap(map: Map[String, String]) defined class MyMap scala> :paste // Entering paste mode (ctrl-D to finish) case class TestCaseClass(a: String, b: String){ def toMyMap: MyMap = { MyMap(Map(a->b)) } def toStr: String = { a } } // Exiting paste mode, now interpreting. defined class TestCaseClass scala> TestCaseClass("a", "nn") res4: TestCaseClass = TestCaseClass(a,nn) scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), TestCaseClass("2015-05-01", "data2"))).toDF() org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `TestCaseClass` without access to the scope that this class was defined in. Try moving this class out of its parent class.; at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:264) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:260) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:233) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:260) at org.apache.spark.sql.Dataset.(Dataset.scala:78) at org.apache.spark.sql.Dataset.(Dataset.scala:89) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:507) ... 52 elided {code} I do remember seeing the above error stack, if the case class was defined inside the scope of an object (For example: If defined inside MyApp like in the example below as it becomes an inner class) >From code, I added an explicit import and eventually changed to use fully >qualified class names like below... {code} import scala.collection.{Map => ImMap} case class MyMap(map: ImMap[String, String]) case class TestCaseClass(a: String, b: String){ def toMyMap: MyMap = { MyMap(ImMap(a->b)) } def toStr: String = { a } } object MyApp extends App { //Get handle to contexts... import sqlContext.implicits._ val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), TestCaseClass("2015-05-01", "data2"))).toDF() df1.as[TestCaseClass].map(_.toStr).show() //works fine df1.as[TestCaseClass].map(_.toMyMap).show() //error } {code} and {code} case class MyMap(map: scala.collection.Map[String, String]) case class TestCaseClass(a: String, b: String){ def toMyMap: MyMap = { MyMap(scala.collection.Map(a->b)) } def toStr: String = { a } } object MyApp extends App { //Get handle to contexts... import sqlContext.implicits._ val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), TestCaseClass("2015-05-01", "data2"))).toDF() df1.as[TestCaseClass].map(_.toStr).show() //works fine df1.as[TestCaseClass].map(_.toMyMap).show() //error } {code} Please advice on what I may be missing. I misread the earlier comment and tried to use immutable map incorrectly :(. > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Wenchen Fan >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caus
[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10985: Assignee: (was: Apache Spark) > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Andrew Or >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102477#comment-15102477 ] Apache Spark commented on SPARK-10985: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10776 > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Andrew Or >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12701) Logging FileAppender should use join to ensure thread is finished
[ https://issues.apache.org/jira/browse/SPARK-12701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12701: -- Fix Version/s: 1.6.1 > Logging FileAppender should use join to ensure thread is finished > - > > Key: SPARK-12701 > URL: https://issues.apache.org/jira/browse/SPARK-12701 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > > Currently, FileAppender for logging uses wait/notifyAll to signal that the > writing thread has finished. While I was trying to write a regression test > for a fix of SPARK-9844, the writing thread was not able to fully complete > before the process was shutdown, despite calling > {{FileAppender.awaitTermination}}. Using join ensures the thread completes > and would simplify things a little more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10985: Assignee: Apache Spark > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)
[ https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102473#comment-15102473 ] Maciej Bryński edited comment on SPARK-12624 at 1/15/16 9:17 PM: - [~davies] Isn't related to my comment here: https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15068627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15068627 was (Author: maver1ck): [~davies] Isn't related to my comment here: https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733 > When schema is specified, we should treat undeclared fields as null (in > Python) > --- > > Key: SPARK-12624 > URL: https://issues.apache.org/jira/browse/SPARK-12624 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Reynold Xin >Priority: Critical > > See https://github.com/apache/spark/pull/10564 > Basically that test case should pass without the above fix and just assume b > is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)
[ https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102473#comment-15102473 ] Maciej Bryński commented on SPARK-12624: [~davies] Isn't related to my comment here: https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733 > When schema is specified, we should treat undeclared fields as null (in > Python) > --- > > Key: SPARK-12624 > URL: https://issues.apache.org/jira/browse/SPARK-12624 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Reynold Xin >Priority: Critical > > See https://github.com/apache/spark/pull/10564 > Basically that test case should pass without the above fix and just assume b > is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102351#comment-15102351 ] Muthu Jayakumar edited comment on SPARK-12783 at 1/15/16 9:09 PM: -- I tried the following, but got similar error... {code} case class MyMap(map: scala.collection.immutable.Map[String, String]) case class TestCaseClass(a: String, b: String){ def toMyMap: MyMap = { MyMap(Map(a->b)) } def toStr: String = { a } } //main thread... val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), TestCaseClass("2015-05-01", "data2"))).toDF() df1.as[TestCaseClass].map(_.toStr).show() //works fine df1.as[TestCaseClass].map(_.toMyMap).show() //error df1.as[TestCaseClass].map(each=> each.a -> each.b).show() //works fine {code} {quote} Serialization stack: - object not serializable (class: scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: package lang) - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: class scala.reflect.internal.Symbols$Symbol) - object (class scala.reflect.internal.Types$UniqueThisType, java.lang.type) - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: class scala.reflect.internal.Types$Type) - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, type: class scala.reflect.internal.Types$Type) - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) - field (class: org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, type: class scala.reflect.api.Types$TypeApi) - object (class org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, name: function, type: interface scala.Function1) - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression) - object (class org.apache.spark.sql.catalyst.expressions.Invoke, invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class [Ljava.lang.Object;))) - writeObject data (class: scala.collection.immutable.List$SerializationProxy) - object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy@2660f093) - writeReplace data (class: scala.collection.immutable.List$SerializationProxy) - object (class scala.collection.immutable.$colon$colon, List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class [Ljava.lang.Object;)), invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class [Ljava.lang.Object; - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: arguments, type: interface scala.collection.Seq) - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(class org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class [Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field (class: "scala.collection.immutable.Map", name: "map"),- root class: "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class [Ljava.lang.Object;)),true)) - writeObject data (class: scala.collection.immutable.List$SerializationProxy) - object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy@72af5ac7) - writeReplace data (class: scala.collection.immutable.List$SerializationProxy) - object
[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12843: --- Description: SQL Query: {code} select * from table limit 100 {code} force Spark to scan all partition even when data are available on the beginning of scan. This behaviour should be avoided and scan should stop when enough data is collected. Is it related to: [SPARK-9850] ? was: SQL Query: {code} select * from table limit 100 {code} force Spark to scan all partition even when data are available on the beginning of scan. Is it related to: [SPARK-9850] ? > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
Maciej Bryński created SPARK-12843: -- Summary: Spark should avoid scanning all partitions when limit is set Key: SPARK-12843 URL: https://issues.apache.org/jira/browse/SPARK-12843 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Maciej Bryński SQL Query: {code} select * from table limit 100 {code} force Spark to scan all partition even when data are available on the beginning of scan. Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102441#comment-15102441 ] Maciej Bryński edited comment on SPARK-12807 at 1/15/16 8:43 PM: - I'm asking if it's possible. About running Spark shuffle. Did you miss link to: https://issues.apache.org/jira/browse/SPARK-9439 ? Problem started with Spark 1.6.0, because it's first version of Spark where Shuffle has Jackson dependency was (Author: maver1ck): I'm asking if it's possible. About running Spark shuffle. Did you miss link to: https://issues.apache.org/jira/browse/SPARK-9439 ? Problem started with Spark 1.6.0, because it's first version of Spark where Spark Shuffle has Jackson dependency > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102441#comment-15102441 ] Maciej Bryński commented on SPARK-12807: I'm asking if it's possible. About running Spark shuffle. Did you miss link to: https://issues.apache.org/jira/browse/SPARK-9439 ? Problem started with Spark 1.6.0, because it's first version of Spark where Spark Shuffle has Jackson dependency > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102430#comment-15102430 ] Sean Owen commented on SPARK-12807: --- Are you asking if it's possible, a possible explanation, a workaround? I'm still not sure why it's a problem (now). For example people seem to be running Spark shuffle just fine with recent Hadoop. > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12842) Add Hadoop 2.7 build profile
[ https://issues.apache.org/jira/browse/SPARK-12842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102424#comment-15102424 ] Apache Spark commented on SPARK-12842: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10775 > Add Hadoop 2.7 build profile > > > Key: SPARK-12842 > URL: https://issues.apache.org/jira/browse/SPARK-12842 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should add a Hadoop 2.7 build profile so that we can automate tests > against Hadoop 2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12842) Add Hadoop 2.7 build profile
Josh Rosen created SPARK-12842: -- Summary: Add Hadoop 2.7 build profile Key: SPARK-12842 URL: https://issues.apache.org/jira/browse/SPARK-12842 Project: Spark Issue Type: Bug Components: Build Reporter: Josh Rosen Assignee: Josh Rosen We should add a Hadoop 2.7 build profile so that we can automate tests against Hadoop 2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
[ https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102416#comment-15102416 ] Maciej Bryński commented on SPARK-12807: Sean, Maybe it's possible to compile YARN Shuffle with different version of Jackson than version using by Spark Core ? > Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 > > > Key: SPARK-12807 > URL: https://issues.apache.org/jira/browse/SPARK-12807 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.0 > Environment: A Hadoop cluster with Jackson 2.2.3, spark running with > dynamic allocation enabled >Reporter: Steve Loughran >Priority: Critical > > When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, > you get to see a stack trace in the NM logs, indicating a jackson 2.x version > mismatch. > (reported on the spark dev list) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12833) Initial import of databricks/spark-csv
[ https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102407#comment-15102407 ] Apache Spark commented on SPARK-12833: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/10774 > Initial import of databricks/spark-csv > -- > > Key: SPARK-12833 > URL: https://issues.apache.org/jira/browse/SPARK-12833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Hossein Falaki > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12841) UnresolvedException with cast
Michael Armbrust created SPARK-12841: Summary: UnresolvedException with cast Key: SPARK-12841 URL: https://issues.apache.org/jira/browse/SPARK-12841 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Michael Armbrust Assignee: Wenchen Fan Priority: Blocker {code} val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.where(df1.col("single").cast("string").equalTo("1")) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12667) Remove block manager's internal "external block store" API
[ https://issues.apache.org/jira/browse/SPARK-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12667. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10752 [https://github.com/apache/spark/pull/10752] > Remove block manager's internal "external block store" API > -- > > Key: SPARK-12667 > URL: https://issues.apache.org/jira/browse/SPARK-12667 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function
[ https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102386#comment-15102386 ] Herman van Hovell commented on SPARK-12835: --- I can reproduce your problem with the following scala code: {noformat} import java.sql.Date import org.apache.spark.sql.expressions.Window val df = Seq( (Date.valueOf("2014-01-01")), (Date.valueOf("2014-02-01")), (Date.valueOf("2014-03-01")), (Date.valueOf("2014-03-06")), (Date.valueOf("2014-08-23")), (Date.valueOf("2014-10-01"))). map(Tuple1.apply). toDF("ts") // This doesn't work: df.select(avg(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts").show // This does work: df.select(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts"))).as("diff")) .select(avg($"diff")) .show {noformat} It seems there is a small bug in the analyzer. > StackOverflowError when aggregating over column from window function > > > Key: SPARK-12835 > URL: https://issues.apache.org/jira/browse/SPARK-12835 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Kalle Jepsen > > I am encountering a StackoverflowError with a very long traceback, when I try > to directly aggregate on a column created by a window function. > E.g. I am trying to determine the average timespan between dates in a > Dataframe column by using a window-function: > {code} > from pyspark import SparkContext > from pyspark.sql import HiveContext, Window, functions > from datetime import datetime > sc = SparkContext() > sq = HiveContext(sc) > data = [ > [datetime(2014,1,1)], > [datetime(2014,2,1)], > [datetime(2014,3,1)], > [datetime(2014,3,6)], > [datetime(2014,8,23)], > [datetime(2014,10,1)], > ] > df = sq.createDataFrame(data, schema=['ts']) > ts = functions.col('ts') > > w = Window.orderBy(ts) > diff = functions.datediff( > ts, > functions.lag(ts, count=1).over(w) > ) > avg_diff = functions.avg(diff) > {code} > While {{df.select(diff.alias('diff')).show()}} correctly renders as > {noformat} > ++ > |diff| > ++ > |null| > | 31| > | 28| > | 5| > | 170| > | 39| > ++ > {noformat} > doing {code} > df.select(avg_diff).show() > {code} throws a {{java.lang.StackOverflowError}}. > When I say > {code} > df2 = df.select(diff.alias('diff')) > df2.select(functions.avg('diff')) > {code} > however, there's no error. > Am I wrong to assume that the above should work? > I've already described the same in [this question on > stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12833) Initial import of databricks/spark-csv
[ https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12833. - Resolution: Fixed Fix Version/s: 2.0.0 > Initial import of databricks/spark-csv > -- > > Key: SPARK-12833 > URL: https://issues.apache.org/jira/browse/SPARK-12833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Hossein Falaki > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102361#comment-15102361 ] kevin yu commented on SPARK-12783: -- Hello Muthu: do the import first, it seems working. scala> import scala.collection.Map import scala.collection.Map scala> case class MyMap(map: Map[String, String]) defined class MyMap scala> scala> case class TestCaseClass(a: String, b: String) { | def toMyMap: MyMap = { | MyMap(Map(a->b)) | } | | def toStr: String = { | a | } | } defined class TestCaseClass scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), TestCaseClass("2015-05-01", "data2"))).toDF() df1: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> df1.as[TestCaseClass].map(_.toMyMap).show() ++ | map| ++ |Map(2015-05-01 ->...| |Map(2015-05-01 ->...| ++ > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Wenchen Fan >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caused by: java.io.NotSerializableException: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1 > Serialization stack: > - object not serializable (class: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: > package lang) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, > java.lang.type) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) > - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, > type: class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, > type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) > - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, > name: function, type: interface scala.Function1) > - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, > mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) > - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: > targetObject, type: class > org.apache.spark.sql.catalyst.expressions.Expression) > - object (class org.apache.spark.sql.catalyst.expressions.Invoke, > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;))) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@4c7e3aab) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;)), > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scal