[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download
[ https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reassigned SPARK-13855: --- Assignee: Patrick Wendell (was: Michael Armbrust) > Spark 1.6.1 artifacts not found in S3 bucket / direct download > -- > > Key: SPARK-13855 > URL: https://issues.apache.org/jira/browse/SPARK-13855 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: production >Reporter: Sandesh Deshmane >Assignee: Patrick Wendell > > Getting below error while deploying spark on EC2 with version 1.6.1 > [timing] scala init: 00h 00m 12s > Initializing spark > --2016-03-14 07:05:30-- > http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz > Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12 > Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... > connected. > HTTP request sent, awaiting response... 404 Not Found > 2016-03-14 07:05:30 ERROR 404: Not Found. > ERROR: Unknown Spark version > spark/init.sh: line 137: return: -1: invalid option > return: usage: return [n] > Unpacking Spark > tar (child): spark-*.tgz: Cannot open: No such file or directory > tar (child): Error is not recoverable: exiting now > tar: Child returned status 2 > tar: Error is not recoverable: exiting now > rm: cannot remove `spark-*.tgz': No such file or directory > mv: missing destination file operand after `spark' > Try `mv --help' for more information. > Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 > present -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
Reynold Xin created SPARK-13928: --- Summary: Move org.apache.spark.Logging into org.apache.spark.internal.Logging Key: SPARK-13928 URL: https://issues.apache.org/jira/browse/SPARK-13928 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. Alternatively, we can also provide in a compatibility package that adds logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrix
[ https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13927: -- Summary: Add row/column iterator to local matrix (was: Add row/column iterator to matrix) > Add row/column iterator to local matrix > --- > > Key: SPARK-13927 > URL: https://issues.apache.org/jira/browse/SPARK-13927 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > Add row/column iterator to local matrices to simplify tasks like BlockMatrix > => RowMatrix conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrices
[ https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13927: -- Summary: Add row/column iterator to local matrices (was: Add row/column iterator to local matrix) > Add row/column iterator to local matrices > - > > Key: SPARK-13927 > URL: https://issues.apache.org/jira/browse/SPARK-13927 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > Add row/column iterator to local matrices to simplify tasks like BlockMatrix > => RowMatrix conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13927) Add row/column iterator to matrix
Xiangrui Meng created SPARK-13927: - Summary: Add row/column iterator to matrix Key: SPARK-13927 URL: https://issues.apache.org/jira/browse/SPARK-13927 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source
[ https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13764: Assignee: (was: Apache Spark) > Parse modes in JSON data source > --- > > Key: SPARK-13764 > URL: https://issues.apache.org/jira/browse/SPARK-13764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, JSON data source just fails to read if some JSON documents are > malformed. > Therefore, if there are two JSON documents below: > {noformat} > { > "request": { > "user": { > "id": 123 > } > } > } > {noformat} > {noformat} > { > "request": { > "user": [] > } > } > {noformat} > This will fail emitting the exception below : > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: > Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): > java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData > cannot be cast to org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > So, just like the parse modes in CSV data source, (See > https://github.com/databricks/spark-csv), it would be great if there are some > parse modes so that users do not have to filter or pre-process themselves. > This happens only when custom schema is set. when this uses inferred schema, > then it infers the type as {{StringType}} which reads the data successfully > anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13764) Parse modes in JSON data source
[ https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196887#comment-15196887 ] Apache Spark commented on SPARK-13764: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/11756 > Parse modes in JSON data source > --- > > Key: SPARK-13764 > URL: https://issues.apache.org/jira/browse/SPARK-13764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, JSON data source just fails to read if some JSON documents are > malformed. > Therefore, if there are two JSON documents below: > {noformat} > { > "request": { > "user": { > "id": 123 > } > } > } > {noformat} > {noformat} > { > "request": { > "user": [] > } > } > {noformat} > This will fail emitting the exception below : > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: > Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): > java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData > cannot be cast to org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > So, just like the parse modes in CSV data source, (See > https://github.com/databricks/spark-csv), it would be great if there are some > parse modes so that users do not have to filter or pre-process themselves. > This happens only when custom schema is set. when this uses inferred schema, > then it infers the type as {{StringType}} which reads the data successfully > anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source
[ https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13764: Assignee: Apache Spark > Parse modes in JSON data source > --- > > Key: SPARK-13764 > URL: https://issues.apache.org/jira/browse/SPARK-13764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, JSON data source just fails to read if some JSON documents are > malformed. > Therefore, if there are two JSON documents below: > {noformat} > { > "request": { > "user": { > "id": 123 > } > } > } > {noformat} > {noformat} > { > "request": { > "user": [] > } > } > {noformat} > This will fail emitting the exception below : > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: > Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): > java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData > cannot be cast to org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > So, just like the parse modes in CSV data source, (See > https://github.com/databricks/spark-csv), it would be great if there are some > parse modes so that users do not have to filter or pre-process themselves. > This happens only when custom schema is set. when this uses inferred schema, > then it infers the type as {{StringType}} which reads the data successfully > anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:45 AM: - There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, 10) val topKUsersForItems = model.recommendUsers(df, 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. was (Author: mlnick): There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df) // Option 2 - requires to (re)specify the user / item input col in the input DF val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195702#comment-15195702 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM: - Also, what's nice in the ML API is that SPARK-10802 is essentially taken care of by passing in a DataFrame with the users of interest, e.g. {code} val users = df.filter(df("age") > 21) val topK = model.setK(10).setUserTopKCol("userTopK").transform(users) {code} was (Author: mlnick): Also, what's nice in the ML API is that SPARK-10802 is essentially taken care of by passing in a DataFrame with the users of interest, e.g. {code} val users = df.filter(df("age") > 21) val topK = model.setK(10).setTopKCol("userId").transform(users) {code} > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM: - There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df) // Option 2 - requires to (re)specify the user / item input col in the input DF val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. was (Author: mlnick): There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, 10) val topKUsersForItems = model.recommendUsers(df, 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:41 AM: - There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, 10) val topKUsersForItems = model.recommendUsers(df, 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. was (Author: mlnick): There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM: - There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding methods such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. was (Author: mlnick): There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding a method such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696 ] Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM: - There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding a method such as {{recommendItems}} and {{recommendUsers}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and export the resulting predictions DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 - requires 3 extra params val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df) val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df) // Option 2 val topKItemsForUsers = model.recommendItems(df, "userId", 10) val topKUsersForItems = model.recommendUsers(df, "itemId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the {{Transformer}} API, even though it's a little more clunky. was (Author: mlnick): There are two broad options for adding this, in terms of ML API: # Extending {{transform}} to work with additional param(s) to specify whether to recommend top-k. # Adding a method such as {{predictTopK}}. I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However this seems to fall more naturally into #1, so that it can be part of a Pipeline. Having said that, this is likely to be the final stage of a pipeline - use model to batch-predict recommendations, and save the resulting DF - so perhaps not that important. e.g. {code} val model = ALS.fit(df) // model has userCol and itemCol set, so calling transform makes predictions for each user, item combination val predictions = model.transform(df) // Option 1 val topKItemsForUsers = model.setK(10).setTopKCol("userId").transform(df) // Option 2 val topKItemsForUsers = model.predictTopK("userId", 10) {code} [~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit more neatly into the {{Transformer}} API, even though it's a little more clunky. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13899) Produce InternalRow instead of external Row
[ https://issues.apache.org/jira/browse/SPARK-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13899. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.0.0 > Produce InternalRow instead of external Row > --- > > Key: SPARK-13899 > URL: https://issues.apache.org/jira/browse/SPARK-13899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.0.0 > > > Currently CSVRelation.parseCsv produces external {{Row}}. > As described as a todo to avoid encoding, It would be great if this produces > {{InternalRow}} instead of external {{Row}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types
[ https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13926: Assignee: Apache Spark (was: Josh Rosen) > Automatically use Kryo serializer when shuffling RDDs with simple types > --- > > Key: SPARK-13926 > URL: https://issues.apache.org/jira/browse/SPARK-13926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Josh Rosen >Assignee: Apache Spark > > Because ClassTags are available when constructing ShuffledRDD we can use > them to automatically use Kryo for shuffle serialization when the RDD's types > are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or > combiner types are primitives, arrays of primitives, or strings). This is > likely to result in a large performance gain for many RDD API workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types
[ https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13926: Assignee: Josh Rosen (was: Apache Spark) > Automatically use Kryo serializer when shuffling RDDs with simple types > --- > > Key: SPARK-13926 > URL: https://issues.apache.org/jira/browse/SPARK-13926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > Because ClassTags are available when constructing ShuffledRDD we can use > them to automatically use Kryo for shuffle serialization when the RDD's types > are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or > combiner types are primitives, arrays of primitives, or strings). This is > likely to result in a large performance gain for many RDD API workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types
[ https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196857#comment-15196857 ] Apache Spark commented on SPARK-13926: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11755 > Automatically use Kryo serializer when shuffling RDDs with simple types > --- > > Key: SPARK-13926 > URL: https://issues.apache.org/jira/browse/SPARK-13926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > Because ClassTags are available when constructing ShuffledRDD we can use > them to automatically use Kryo for shuffle serialization when the RDD's types > are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or > combiner types are primitives, arrays of primitives, or strings). This is > likely to result in a large performance gain for many RDD API workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13920. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13926) Automatically use Kryo serializer when it is known to be safe
Josh Rosen created SPARK-13926: -- Summary: Automatically use Kryo serializer when it is known to be safe Key: SPARK-13926 URL: https://issues.apache.org/jira/browse/SPARK-13926 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Josh Rosen Assignee: Josh Rosen Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings). This is likely to result in a large performance gain for many RDD API workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types
[ https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13926: --- Summary: Automatically use Kryo serializer when shuffling RDDs with simple types (was: Automatically use Kryo serializer when it is known to be safe) > Automatically use Kryo serializer when shuffling RDDs with simple types > --- > > Key: SPARK-13926 > URL: https://issues.apache.org/jira/browse/SPARK-13926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Josh Rosen >Assignee: Josh Rosen > > Because ClassTags are available when constructing ShuffledRDD we can use > them to automatically use Kryo for shuffle serialization when the RDD's types > are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or > combiner types are primitives, arrays of primitives, or strings). This is > likely to result in a large performance gain for many RDD API workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196805#comment-15196805 ] Hari Shreedharan edited comment on SPARK-13877 at 3/16/16 5:49 AM: --- [~c...@koeninger.org] - Sure. I agree with having one or more repos - each building against a set of compatible APIs. My point is whatever the case be - it is more flexible to do that outside Spark than have multiple codebases/modules inside. was (Author: hshreedharan): [~c...@koeninger.org] - Sure. I agree with having one or more repos - each building against a set of compatible APIs. My point is whatever the case be - it is more flexible to do that outside Spark than have multiple codebases inside. > Consider removing Kafka modules from Spark / Spark Streaming > > > Key: SPARK-13877 > URL: https://issues.apache.org/jira/browse/SPARK-13877 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Streaming >Affects Versions: 1.6.1 >Reporter: Hari Shreedharan > > Based on the discussion the PR for SPARK-13843 > ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), > we should consider moving the Kafka modules out of Spark as well. > Providing newer functionality (like security) has become painful while > maintaining compatibility with older versions of Kafka. Moving this out > allows more flexibility, allowing users to mix and match Kafka and Spark > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196805#comment-15196805 ] Hari Shreedharan commented on SPARK-13877: -- [~c...@koeninger.org] - Sure. I agree with having one or more repos - each building against a set of compatible APIs. My point is whatever the case be - it is more flexible to do that outside Spark than have multiple codebases inside. > Consider removing Kafka modules from Spark / Spark Streaming > > > Key: SPARK-13877 > URL: https://issues.apache.org/jira/browse/SPARK-13877 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Streaming >Affects Versions: 1.6.1 >Reporter: Hari Shreedharan > > Based on the discussion the PR for SPARK-13843 > ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), > we should consider moving the Kafka modules out of Spark as well. > Providing newer functionality (like security) has become painful while > maintaining compatibility with older versions of Kafka. Moving this out > allows more flexibility, allowing users to mix and match Kafka and Spark > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions
Xiangrui Meng created SPARK-13925: - Summary: Expose R-like summary statistics in SparkR::glm for more family and link functions Key: SPARK-13925 URL: https://issues.apache.org/jira/browse/SPARK-13925 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Xiangrui Meng This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose R-like model summary in more family and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions
[ https://issues.apache.org/jira/browse/SPARK-13925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13925: -- Priority: Critical (was: Major) > Expose R-like summary statistics in SparkR::glm for more family and link > functions > -- > > Key: SPARK-13925 > URL: https://issues.apache.org/jira/browse/SPARK-13925 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Priority: Critical > > This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose > R-like model summary in more family and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9837) Provide R-like summary statistics for GLMs via iteratively reweighted least squares
[ https://issues.apache.org/jira/browse/SPARK-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9837. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11694 [https://github.com/apache/spark/pull/11694] > Provide R-like summary statistics for GLMs via iteratively reweighted least > squares > --- > > Key: SPARK-9837 > URL: https://issues.apache.org/jira/browse/SPARK-9837 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Fix For: 2.0.0 > > > This is similar to SPARK-9836 but for GLMs other than ordinary least squares. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13917) Generate code for broadcast left semi join
[ https://issues.apache.org/jira/browse/SPARK-13917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13917. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11742 [https://github.com/apache/spark/pull/11742] > Generate code for broadcast left semi join > -- > > Key: SPARK-13917 > URL: https://issues.apache.org/jira/browse/SPARK-13917 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13903) Modify output nullability with constraints for Join and Filter operators
[ https://issues.apache.org/jira/browse/SPARK-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-13903: Summary: Modify output nullability with constraints for Join and Filter operators (was: Modify output nullability with constraints for Join operator) > Modify output nullability with constraints for Join and Filter operators > > > Key: SPARK-13903 > URL: https://issues.apache.org/jira/browse/SPARK-13903 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > With constraints and optimization, we can make sure some outputs of a Join > operator are not nulls. We should modify output nullability accordingly. We > can use this information in later execution to avoid null checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13903) Modify output nullability with constraints for Join and Filter operators
[ https://issues.apache.org/jira/browse/SPARK-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-13903: Description: With constraints and optimization, we can make sure some outputs of a Join (or Filter) operator are not nulls. We should modify output nullability accordingly. We can use this information in later execution to avoid null checking. Another reason to modify plan output is that we will use the output to determine schema. We should keep correct nullability in the schema. was:With constraints and optimization, we can make sure some outputs of a Join operator are not nulls. We should modify output nullability accordingly. We can use this information in later execution to avoid null checking. > Modify output nullability with constraints for Join and Filter operators > > > Key: SPARK-13903 > URL: https://issues.apache.org/jira/browse/SPARK-13903 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > With constraints and optimization, we can make sure some outputs of a Join > (or Filter) operator are not nulls. We should modify output nullability > accordingly. We can use this information in later execution to avoid null > checking. > Another reason to modify plan output is that we will use the output to > determine schema. We should keep correct nullability in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196765#comment-15196765 ] Xusen Yin commented on SPARK-13641: --- [~muralidh] I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196765#comment-15196765 ] Xusen Yin edited comment on SPARK-13641 at 3/16/16 5:00 AM: I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. was (Author: yinxusen): [~muralidh] I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13924) officially support multi-insert
[ https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13924: Assignee: Apache Spark > officially support multi-insert > --- > > Key: SPARK-13924 > URL: https://issues.apache.org/jira/browse/SPARK-13924 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13924) officially support multi-insert
[ https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13924: Assignee: (was: Apache Spark) > officially support multi-insert > --- > > Key: SPARK-13924 > URL: https://issues.apache.org/jira/browse/SPARK-13924 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13924) officially support multi-insert
[ https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196760#comment-15196760 ] Apache Spark commented on SPARK-13924: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11754 > officially support multi-insert > --- > > Key: SPARK-13924 > URL: https://issues.apache.org/jira/browse/SPARK-13924 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards
[ https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196758#comment-15196758 ] Apache Spark commented on SPARK-13316: -- User 'mwws' has created a pull request for this issue: https://github.com/apache/spark/pull/11753 > "SparkException: DStream has not been initialized" when restoring > StreamingContext from checkpoint and the dstream is created afterwards > > > Key: SPARK-13316 > URL: https://issues.apache.org/jira/browse/SPARK-13316 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Jacek Laskowski >Priority: Minor > > I faced the issue today but [it was already reported on > SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the > reason is that a dstream is registered after a StreamingContext has been > recreated from checkpoint. > It _appears_ that...no dstreams must be registered after a StreamingContext > has been recreated from checkpoint. It is *not* obvious at first. > The code: > {code} > def createStreamingContext(): StreamingContext = { > val ssc = new StreamingContext(sparkConf, Duration(1000)) > ssc.checkpoint(checkpointDir) > ssc > } > val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) > val socketStream = ssc.socketTextStream(...) > socketStream.checkpoint(Seconds(1)) > socketStream.foreachRDD(...) > {code} > It should be described in docs at the very least and/or checked in the code > when the streaming computation starts. > The exception is as follows: > {code} > org.apache.spark.SparkException: > org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been > initialized > at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) > at > org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) > at > org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) > at > org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () > at > org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579) > ... 43 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards
[ https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13316: Assignee: Apache Spark > "SparkException: DStream has not been initialized" when restoring > StreamingContext from checkpoint and the dstream is created afterwards > > > Key: SPARK-13316 > URL: https://issues.apache.org/jira/browse/SPARK-13316 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > I faced the issue today but [it was already reported on > SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the > reason is that a dstream is registered after a StreamingContext has been > recreated from checkpoint. > It _appears_ that...no dstreams must be registered after a StreamingContext > has been recreated from checkpoint. It is *not* obvious at first. > The code: > {code} > def createStreamingContext(): StreamingContext = { > val ssc = new StreamingContext(sparkConf, Duration(1000)) > ssc.checkpoint(checkpointDir) > ssc > } > val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) > val socketStream = ssc.socketTextStream(...) > socketStream.checkpoint(Seconds(1)) > socketStream.foreachRDD(...) > {code} > It should be described in docs at the very least and/or checked in the code > when the streaming computation starts. > The exception is as follows: > {code} > org.apache.spark.SparkException: > org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been > initialized > at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) > at > org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) > at > org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) > at > org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () > at > org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579) > ... 43 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards
[ https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13316: Assignee: (was: Apache Spark) > "SparkException: DStream has not been initialized" when restoring > StreamingContext from checkpoint and the dstream is created afterwards > > > Key: SPARK-13316 > URL: https://issues.apache.org/jira/browse/SPARK-13316 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Jacek Laskowski >Priority: Minor > > I faced the issue today but [it was already reported on > SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the > reason is that a dstream is registered after a StreamingContext has been > recreated from checkpoint. > It _appears_ that...no dstreams must be registered after a StreamingContext > has been recreated from checkpoint. It is *not* obvious at first. > The code: > {code} > def createStreamingContext(): StreamingContext = { > val ssc = new StreamingContext(sparkConf, Duration(1000)) > ssc.checkpoint(checkpointDir) > ssc > } > val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) > val socketStream = ssc.socketTextStream(...) > socketStream.checkpoint(Seconds(1)) > socketStream.foreachRDD(...) > {code} > It should be described in docs at the very least and/or checked in the code > when the streaming computation starts. > The exception is as follows: > {code} > org.apache.spark.SparkException: > org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been > initialized > at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) > at > org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) > at > org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) > at > org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585) > at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () > at > org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585) > at > org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579) > ... 43 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly
[ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196756#comment-15196756 ] Ashok kumar Rajendran commented on SPARK-13900: --- Along with dimensions condition, the above queries have some timestamp based filer as well. But it does not seem to affect the query plan much between these 2 types of query. I would highly appreciate any help on optimizing this query as this is a critical query in our application. > Spark SQL queries with OR condition is not optimized properly > - > > Key: SPARK-13900 > URL: https://issues.apache.org/jira/browse/SPARK-13900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ashok kumar Rajendran > > I have a large table with few billions of rows and have a very small table > with 4 dimensional values. All the data is stored in parquet format. I would > like to get rows that match any of these dimensions. For example, > Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR > A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = > B.dimension4. > The query plan takes this as BroadcastNestedLoopJoin and executes for very > long time. > If I execute this as Union queries, it takes around 1.5mins for each > dimension. Each query internally does BroadcastHashJoin. > Select field1, field2 from A, B where A.dimension1 = B.dimension1 > UNION ALL > Select field1, field2 from A, B where A.dimension2 = B.dimension2 > UNION ALL > Select field1, field2 from A, B where A.dimension3 = B.dimension3 > UNION ALL > Select field1, field2 from A, B where A.dimension4 = B.dimension4. > This is obviously not an optimal solution as it makes multiple scanning at > same table but it gives result much better than OR condition. > Seems the SQL optimizer is not working properly which causes huge performance > impact on this type of OR query. > Please correct me if I miss anything here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3308) Ability to read JSON Arrays as tables
[ https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196754#comment-15196754 ] Apache Spark commented on SPARK-3308: - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/11752 > Ability to read JSON Arrays as tables > - > > Key: SPARK-3308 > URL: https://issues.apache.org/jira/browse/SPARK-3308 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Yin Huai >Priority: Critical > Fix For: 1.2.0 > > > Right now we can only read json where each object is on its own line. It > would be nice to be able to read top level json arrays where each element > maps to a row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly
[ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196753#comment-15196753 ] Ashok kumar Rajendran commented on SPARK-13900: --- Explain plan for query with OR condition is as below. Explain execution 16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions out of 24, pruned 0.0% partitions. 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 73.3 KB, free 511.5 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 4.9 KB, free 516.5 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 8 from explain at JavaSparkSQL.java:682 16/03/15 21:00:23 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 73.3 KB, free 589.8 KB) 16/03/15 21:00:23 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 4.9 KB, free 594.7 KB) 16/03/15 21:00:23 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:23 INFO spark.SparkContext: Created broadcast 9 from explain at JavaSparkSQL.java:682 == Parsed Logical Plan == 'Project [unresolvedalias('TableA_Dimension1),unresolvedalias('TableA_Dimension2),unresolvedalias('TableA_Dimension3),unresolvedalias('TableA_timestamp_millis),unresolvedalias('TableA_field40),unresolvedalias('TableB_dimension1 AS TableB_dimension1#248),unresolvedalias('TableB_dimension2 AS inv_ua#249),unresolvedalias('TableB_dimension3 AS TableB_dimension3#250),unresolvedalias('TableB_timestamp_mills AS TableB_timestamp_mills#251)] +- 'Filter (('TableA_Dimension1 = 'TableB_dimension1) || ('TableA_Dimension3 = 'TableB_dimension3)) || ('TableA_Dimension2 = 'TableB_dimension2)) && ('TableA_timestamp_millis >= 'TableB_timestamp_mills)) && ('TableA_timestamp_millis <= ('TableB_timestamp_mills + 360))) && ('TableA_partition_hour_bucket >= 'TableB_partition_hour_bucket)) +- 'Join Inner, None :- 'UnresolvedRelation `TableA`, None +- 'UnresolvedRelation `TableB`, None == Analyzed Logical Plan == TableA_Dimension1: string, TableA_Dimension2: string, TableA_Dimension3: string, TableA_timestamp_millis: string, TableA_field40: string, TableB_dimension1: string, inv_ua: string, TableB_dimension3: string, TableB_timestamp_mills: string Project [TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162 AS TableB_dimension1#248,TableB_dimension2#155 AS inv_ua#249,TableB_dimension3#156 AS TableB_dimension3#250,TableB_timestamp_mills#157 AS TableB_timestamp_mills#251] +- Filter ((TableA_Dimension1#74 = TableB_dimension1#162) || (TableA_Dimension3#68 = TableB_dimension3#156)) || (TableA_Dimension2#94 = TableB_dimension2#155)) && (TableA_timestamp_millis#38 >= TableB_timestamp_mills#157)) && (cast(TableA_timestamp_millis#38 as double) <= (cast(TableB_timestamp_mills#157 as double) + cast(360 as double && (TableA_partition_hour_bucket#153 >= TableB_partition_hour_bucket#159)) +- Join Inner, None :- Subquery TableA : +- Relation[TableA_field0#0,TableA_field1#1,TableA_field2#2,TableA_field3#3,TableA_field4#4,TableA_field5#5,TableA_field6#6,TableA_field7#7,TableA_field8#8,TableA_field9#9,TableA_field10#10,TableA_field11#11,TableA_field12#12,TableA_field13#13,TableA_field14#14,TableA_field15#15,,TableA_field150#150] ParquetRelation +- Subquery TableB +- Relation[timeBucket#154,TableB_dimension2#155,TableB_dimension3#156,TableB_timestamp_mills#157,endTime#158,TableB_partition_hour_bucket#159,endTimeBucket#160,count#161L,TableB_dimension1#162] ParquetRelation == Optimized Logical Plan == Project [TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162 AS TableB_dimension1#248,TableB_dimension2#155 AS inv_ua#249,TableB_dimension3#156 AS TableB_dimension3#250,TableB_timestamp_mills#157 AS TableB_timestamp_mills#251] +- Join Inner, Some(((TableA_Dimension1#74 = TableB_dimension1#162) || (TableA_Dimension3#68 = TableB_dimension3#156)) || (TableA_Dimension2#94 = TableB_dimension2#155)) && (TableA_timestamp_millis#38 >= TableB_timestamp_mills#157)) && (cast(TableA_timestamp_millis#38 as double) <= (cast(TableB_timestamp_mills#157 as double) + 360.0))) && (TableA_partition_hour_bucket#153 >= TableB_partition_hour_bucket#159))) :- Project [TableA_timestamp_millis#38,TableA_Dimension1#74,TableA_field40#40,TableA_Dimension3#68,TableA_partition_hour_bucket#153,TableA_Dimension2#94] : +- Relation[Ta
[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly
[ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196752#comment-15196752 ] Ashok kumar Rajendran commented on SPARK-13900: --- Hi Xio Li, Thanks for looking into this. Here is the explain plan for these 2 queries. (TableA is a big table with 150 fields, I just shortened here to reduce text size) Execution plan for Union Query. Explain execution 16/03/15 21:00:21 INFO datasources.DataSourceStrategy: Selected 24 partitions out of 24, pruned 0.0% partitions. 16/03/15 21:00:21 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 24.1 KB, free 41.9 KB) 16/03/15 21:00:21 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.9 KB, free 46.9 KB) 16/03/15 21:00:21 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:21 INFO spark.SparkContext: Created broadcast 2 from explain at JavaSparkSQL.java:676 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 73.3 KB, free 120.2 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 4.9 KB, free 125.1 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 3 from explain at JavaSparkSQL.java:676 16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions out of 24, pruned 0.0% partitions. 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 73.3 KB, free 198.5 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.9 KB, free 203.4 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 4 from explain at JavaSparkSQL.java:676 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 73.3 KB, free 276.7 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.9 KB, free 281.7 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 5 from explain at JavaSparkSQL.java:676 16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions out of 24, pruned 0.0% partitions. 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 73.3 KB, free 355.0 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.9 KB, free 359.9 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 6 from explain at JavaSparkSQL.java:676 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 73.3 KB, free 433.3 KB) 16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 4.9 KB, free 438.2 KB) 16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB) 16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 7 from explain at JavaSparkSQL.java:676 == Parsed Logical Plan == Union :- Union : :- Project [TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162 AS TableB_dimension1#163,TableB_dimension2#155 AS inv_ua#164,TableB_dimension3#156 AS TableB_dimension3#165,TableB_timestamp_mills#157 AS TableB_timestamp_mills#166] : : +- Filter (((TableA_timestamp_millis#38 >= TableB_timestamp_mills#157) && (cast(TableA_timestamp_millis#38 as double) <= (cast(TableB_timestamp_mills#157 as double) + cast(360 as double && (TableA_partition_hour_bucket#153 >= TableB_partition_hour_bucket#159)) : : +- Join Inner, Some((TableA_Dimension1#74 = TableB_dimension1#162)) : ::- Subquery TableA : :: +- Relation[TableA_field0#0,TableA_field1#1,TableA_field2#2,TableA_field3#3,TableA_field4#4,TableA_field5#5,TableA_field6#6,TableA_field7#7,TableA_field8#8,TableA_field9#9,TableA_field10#10,TableA_field11#11,TableA_field12#12,TableA_field13#13,TableA_field14#14,TableA_field15#15,,TableA_field150#150] ParquetRelation : :+- Subquery TableB : : +-
[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464 ] Dilip Biswal edited comment on SPARK-13821 at 3/16/16 4:41 AM: --- [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems that there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? was (Author: dkbiswal): [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems the there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13924) officially support multi-insert
Wenchen Fan created SPARK-13924: --- Summary: officially support multi-insert Key: SPARK-13924 URL: https://issues.apache.org/jira/browse/SPARK-13924 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13920: Assignee: Apache Spark > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Assignee: Apache Spark > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196743#comment-15196743 ] Apache Spark commented on SPARK-13920: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11751 > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13920: Assignee: (was: Apache Spark) > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13918. - Resolution: Fixed Fix Version/s: 2.0.0 > merge SortMergeJoin and SortMergeOuterJoin > -- > > Key: SPARK-13918 > URL: https://issues.apache.org/jira/browse/SPARK-13918 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, > should merge them and reduce the duplicated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13764) Parse modes in JSON data source
[ https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196665#comment-15196665 ] Hyukjin Kwon commented on SPARK-13764: -- The issue SPARK-3308 is related with supporting each row wrapped with an array. For other types, this works like the PERMISSIVE mode in CSV data source but only when actual data is an array and the given data type is {{StructType}}, it emits an exception above. This is because JSON data source converts each data to a desirable type by the combination of Jackson parser's token and given data type but the combination of {{START_OBJECT}} and {{ArrayType}} exists for the issue above. So, if the given schema is {{StructType}} and given actual data is an array, the combination of {{START_OBJECT}} and {{ArrayType}} is applied to this case, which causes the exception above. > Parse modes in JSON data source > --- > > Key: SPARK-13764 > URL: https://issues.apache.org/jira/browse/SPARK-13764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, JSON data source just fails to read if some JSON documents are > malformed. > Therefore, if there are two JSON documents below: > {noformat} > { > "request": { > "user": { > "id": 123 > } > } > } > {noformat} > {noformat} > { > "request": { > "user": [] > } > } > {noformat} > This will fail emitting the exception below : > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: > Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): > java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData > cannot be cast to org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > So, just like the parse modes in CSV data source, (See > https://github.com/databricks/spark-csv), it would be great if there are some > parse modes so that users do not have to filter or pre-process themselves. > This happens only when custom schema is set. when this uses inferred schema, > then it infers the type as {{StringType}} which reads the data successfully > anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-13843) Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
[ https://issues.apache.org/jira/browse/SPARK-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196661#comment-15196661 ] Liwei Lin edited comment on SPARK-13843 at 3/16/16 2:50 AM: hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) out -- is this left out on purpose or should we also move that out? Thanks! was (Author: proflin): hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) out -- why is that please? > Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, > streaming-twitter to Spark packages > --- > > Key: SPARK-13843 > URL: https://issues.apache.org/jira/browse/SPARK-13843 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Currently there are a few sub-projects, each for integrating with different > external sources for Streaming. Now that we have better ability to include > external libraries (Spark packages) and with Spark 2.0 coming up, we can move > the following projects out of Spark to https://github.com/spark-packages > - streaming-flume > - streaming-akka > - streaming-mqtt > - streaming-zeromq > - streaming-twitter > They are just some ancillary packages and considering the overhead of > maintenance, running tests and PR failures, it's better to maintain them out > of Spark. In addition, these projects can have their different release cycles > and we can release them faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables
[ https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13923: Assignee: Andrew Or (was: Apache Spark) > Implement SessionCatalog to manage temp functions and tables > > > Key: SPARK-13923 > URL: https://issues.apache.org/jira/browse/SPARK-13923 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Today, we have ExternalCatalog, which is dead code. As part of the effort of > merging SQLContext/HiveContext we'll parse Hive commands and call the > corresponding methods in ExternalCatalog. > However, this handles only persisted things. We need something in addition to > that to handle temporary things. The new catalog is called SessionCatalog and > will internally call ExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables
[ https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196664#comment-15196664 ] Apache Spark commented on SPARK-13923: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11750 > Implement SessionCatalog to manage temp functions and tables > > > Key: SPARK-13923 > URL: https://issues.apache.org/jira/browse/SPARK-13923 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Today, we have ExternalCatalog, which is dead code. As part of the effort of > merging SQLContext/HiveContext we'll parse Hive commands and call the > corresponding methods in ExternalCatalog. > However, this handles only persisted things. We need something in addition to > that to handle temporary things. The new catalog is called SessionCatalog and > will internally call ExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables
[ https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13923: Assignee: Apache Spark (was: Andrew Or) > Implement SessionCatalog to manage temp functions and tables > > > Key: SPARK-13923 > URL: https://issues.apache.org/jira/browse/SPARK-13923 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > > Today, we have ExternalCatalog, which is dead code. As part of the effort of > merging SQLContext/HiveContext we'll parse Hive commands and call the > corresponding methods in ExternalCatalog. > However, this handles only persisted things. We need something in addition to > that to handle temporary things. The new catalog is called SessionCatalog and > will internally call ExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13843) Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
[ https://issues.apache.org/jira/browse/SPARK-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196661#comment-15196661 ] Liwei Lin commented on SPARK-13843: --- hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) out -- why is that please? > Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, > streaming-twitter to Spark packages > --- > > Key: SPARK-13843 > URL: https://issues.apache.org/jira/browse/SPARK-13843 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Currently there are a few sub-projects, each for integrating with different > external sources for Streaming. Now that we have better ability to include > external libraries (Spark packages) and with Spark 2.0 coming up, we can move > the following projects out of Spark to https://github.com/spark-packages > - streaming-flume > - streaming-akka > - streaming-mqtt > - streaming-zeromq > - streaming-twitter > They are just some ancillary packages and considering the overhead of > maintenance, running tests and PR failures, it's better to maintain them out > of Spark. In addition, these projects can have their different release cycles > and we can release them faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables
Andrew Or created SPARK-13923: - Summary: Implement SessionCatalog to manage temp functions and tables Key: SPARK-13923 URL: https://issues.apache.org/jira/browse/SPARK-13923 Project: Spark Issue Type: Bug Components: SQL Reporter: Andrew Or Assignee: Andrew Or Today, we have ExternalCatalog, which is dead code. As part of the effort of merging SQLContext/HiveContext we'll parse Hive commands and call the corresponding methods in ExternalCatalog. However, this handles only persisted things. We need something in addition to that to handle temporary things. The new catalog is called SessionCatalog and will internally call ExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196626#comment-15196626 ] Dongjoon Hyun edited comment on SPARK-13920 at 3/16/16 2:16 AM: Oh, [~joshrosen]. I totally agree with you on the purpose of this issue. I'll start to work this issue. was (Author: dongjoon): Oh, [~joshrosen]. I totally agree with the purpose of this issue. I'll start to work this issue. > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196626#comment-15196626 ] Dongjoon Hyun commented on SPARK-13920: --- Oh, [~joshrosen]. I totally agree with the purpose of this issue. I'll start to work this issue. > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13915) Allow bin/spark-submit to be called via symbolic link
[ https://issues.apache.org/jira/browse/SPARK-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196590#comment-15196590 ] Saisai Shao commented on SPARK-13915: - I suppose it should be worked if you set {{SPARK_HOME}} in your environment. For the vendor release version, I think you need to check whether script is changed or not, that may be slightly different from community version. > Allow bin/spark-submit to be called via symbolic link > - > > Key: SPARK-13915 > URL: https://issues.apache.org/jira/browse/SPARK-13915 > Project: Spark > Issue Type: Improvement > Components: Spark Submit > Environment: CentOS 6.6 > Tarbal spark distribution and CDH-5.x.x Spark version (both) >Reporter: Rafael Pecin Ferreira >Priority: Minor > > We have a CDH-5 cluster that comes with spark-1.5.0 and we needed to use > spark-1.5.1 for bug fix issues. > When I set up the spark (out of the CDH box) to the system alternatives, it > created a sequence of symbolic links to the target spark installation. > When I tried to run spark-submit, the bash process call the target with "$0" > as /usr/bin/spark-submit, but this script use the "$0" variable to locate its > deps and I was facing this messages: > [hdfs@server01 ~]$ env spark-submit > ls: cannot access /usr/assembly/target/scala-2.10: No such file or directory > Failed to find Spark assembly in /usr/assembly/target/scala-2.10. > You need to build Spark before running this program. > I fixed the spark-submit script adding this lines: > if [ -h "$0" ] ; then > checklink="$0"; > while [ -h "$checklink" ] ; do > checklink=`readlink $checklink` > done > SPARK_HOME="$(cd "`dirname "$checklink"`"/..; pwd)"; > else > SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"; > fi > It would be very nice if this piece of code be put into the spark-submit > script to allow us to have multiple spark alternatives on the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13872) Memory leak in SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196585#comment-15196585 ] Yin Huai commented on SPARK-13872: -- Also cc [~davies] > Memory leak in SortMergeOuterJoin > - > > Key: SPARK-13872 > URL: https://issues.apache.org/jira/browse/SPARK-13872 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ian > Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png > > > SortMergeJoin composes its partition/iterator from > org.apache.spark.sql.execution.Sort, which in turns designates the sorting to > UnsafeExternalRowSorter. > UnsafeExternalRowSorter's implementation cleans up the resources when: > 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully > iterated. > 2. task is done execution. > In outer join case of SortMergeJoin, when the left or right iterator is not > fully iterated, the only chance for the resources to be cleaned up is at the > end of the spark task run. > This probably ok most of the time, however when a SortMergeOuterJoin is > nested within a CartesianProduct, the "deferred" resources cleanup allows a > none-ignorable memory leak amplified/cumulated by the loop driven by the > CartesianRdd's looping iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13914) Add functionality to back up spark event logs
[ https://issues.apache.org/jira/browse/SPARK-13914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196580#comment-15196580 ] Saisai Shao commented on SPARK-13914: - Agree with [~srowen], this feature is quite user specific, actually you may have many other ways to address your problem, rather than putting into spark code, this will increase the maintaining overhead, and I believe most of the users will not use this functionality. > Add functionality to back up spark event logs > - > > Key: SPARK-13914 > URL: https://issues.apache.org/jira/browse/SPARK-13914 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0, 1.6.2, 2.0.0 >Reporter: Parag Chaudhari > > Spark event logs are usually stored in HDFS when running Spark on YARN. In a > cloud environment, these HDFS files are often stored on the disks of > ephemeral instances that could go away once the instances are terminated. > Users may want to persist the event logs as the event happens for issue > investigation and performance analysis before and after the cluster is > terminated. The backup path can be managed by the spark users based on their > needs. For example, some users may copy the event logs to a cloud storage > service directly and keep them there forever. While some other users may want > to store the event logs on local disks and back them up to a cloud storage > service from time to time. Other users will not want to use the feature, so > this feature should be off by default; users enable the feature when and only > when they need it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13922: Assignee: (was: Apache Spark) > Filter rows with null attributes in parquet vectorized reader > - > > Key: SPARK-13922 > URL: https://issues.apache.org/jira/browse/SPARK-13922 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sameer Agarwal > > It's common for many SQL operators to not care about reading null values for > correctness. Currently, this is achieved by performing `isNotNull` checks > (for all relevant columns) on a per-row basis. Pushing these null filters in > parquet vectorized reader should bring considerable benefits (especially for > cases when the underlying data doesn't contain any nulls or contains all > nulls). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196577#comment-15196577 ] Apache Spark commented on SPARK-13922: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/11749 > Filter rows with null attributes in parquet vectorized reader > - > > Key: SPARK-13922 > URL: https://issues.apache.org/jira/browse/SPARK-13922 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sameer Agarwal > > It's common for many SQL operators to not care about reading null values for > correctness. Currently, this is achieved by performing `isNotNull` checks > (for all relevant columns) on a per-row basis. Pushing these null filters in > parquet vectorized reader should bring considerable benefits (especially for > cases when the underlying data doesn't contain any nulls or contains all > nulls). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13922: Assignee: Apache Spark > Filter rows with null attributes in parquet vectorized reader > - > > Key: SPARK-13922 > URL: https://issues.apache.org/jira/browse/SPARK-13922 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > > It's common for many SQL operators to not care about reading null values for > correctness. Currently, this is achieved by performing `isNotNull` checks > (for all relevant columns) on a per-row basis. Pushing these null filters in > parquet vectorized reader should bring considerable benefits (especially for > cases when the underlying data doesn't contain any nulls or contains all > nulls). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader
Sameer Agarwal created SPARK-13922: -- Summary: Filter rows with null attributes in parquet vectorized reader Key: SPARK-13922 URL: https://issues.apache.org/jira/browse/SPARK-13922 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sameer Agarwal It's common for many SQL operators to not care about reading null values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in parquet vectorized reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore
[ https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13921: Assignee: Josh Rosen (was: Apache Spark) > Store serialized blocks as multiple chunks in MemoryStore > - > > Key: SPARK-13921 > URL: https://issues.apache.org/jira/browse/SPARK-13921 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > Instead of storing serialized blocks in individual ByteBuffers, the > BlockManager should be capable of storing a serialized block in multiple > chunks, each occupying a separate ByteBuffer. > This change will help to improve the efficiency of memory allocation and the > accuracy of memory accounting when serializing blocks. Our current > serialization code uses a {{ByteBufferOutputStream}}, which doubles and > re-allocates its backing byte array; this increases the peak memory > requirements during serialization (since we need to hold extra memory while > expanding the array). In addition, we currently don't account for the extra > wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte > serialized block may actually consume 256 megabytes of memory. After > switching to storing blocks in multiple chunks, we'll be able to efficiently > trim the backing buffers so that no space is wasted. > This change is also a prerequisite to being able to cache blocks which are > larger than 2GB (although full support for that depends on several other > changes which have not bee implemented yet). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore
[ https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196567#comment-15196567 ] Apache Spark commented on SPARK-13921: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11748 > Store serialized blocks as multiple chunks in MemoryStore > - > > Key: SPARK-13921 > URL: https://issues.apache.org/jira/browse/SPARK-13921 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > Instead of storing serialized blocks in individual ByteBuffers, the > BlockManager should be capable of storing a serialized block in multiple > chunks, each occupying a separate ByteBuffer. > This change will help to improve the efficiency of memory allocation and the > accuracy of memory accounting when serializing blocks. Our current > serialization code uses a {{ByteBufferOutputStream}}, which doubles and > re-allocates its backing byte array; this increases the peak memory > requirements during serialization (since we need to hold extra memory while > expanding the array). In addition, we currently don't account for the extra > wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte > serialized block may actually consume 256 megabytes of memory. After > switching to storing blocks in multiple chunks, we'll be able to efficiently > trim the backing buffers so that no space is wasted. > This change is also a prerequisite to being able to cache blocks which are > larger than 2GB (although full support for that depends on several other > changes which have not bee implemented yet). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore
[ https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13921: Assignee: Apache Spark (was: Josh Rosen) > Store serialized blocks as multiple chunks in MemoryStore > - > > Key: SPARK-13921 > URL: https://issues.apache.org/jira/browse/SPARK-13921 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Apache Spark > > Instead of storing serialized blocks in individual ByteBuffers, the > BlockManager should be capable of storing a serialized block in multiple > chunks, each occupying a separate ByteBuffer. > This change will help to improve the efficiency of memory allocation and the > accuracy of memory accounting when serializing blocks. Our current > serialization code uses a {{ByteBufferOutputStream}}, which doubles and > re-allocates its backing byte array; this increases the peak memory > requirements during serialization (since we need to hold extra memory while > expanding the array). In addition, we currently don't account for the extra > wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte > serialized block may actually consume 256 megabytes of memory. After > switching to storing blocks in multiple chunks, we'll be able to efficiently > trim the backing buffers so that no space is wasted. > This change is also a prerequisite to being able to cache blocks which are > larger than 2GB (although full support for that depends on several other > changes which have not bee implemented yet). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore
Josh Rosen created SPARK-13921: -- Summary: Store serialized blocks as multiple chunks in MemoryStore Key: SPARK-13921 URL: https://issues.apache.org/jira/browse/SPARK-13921 Project: Spark Issue Type: Improvement Components: Block Manager Reporter: Josh Rosen Assignee: Josh Rosen Instead of storing serialized blocks in individual ByteBuffers, the BlockManager should be capable of storing a serialized block in multiple chunks, each occupying a separate ByteBuffer. This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a {{ByteBufferOutputStream}}, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted. This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13872) Memory leak in SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196561#comment-15196561 ] Mark Hamstra commented on SPARK-13872: -- [~joshrosen] > Memory leak in SortMergeOuterJoin > - > > Key: SPARK-13872 > URL: https://issues.apache.org/jira/browse/SPARK-13872 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ian > Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png > > > SortMergeJoin composes its partition/iterator from > org.apache.spark.sql.execution.Sort, which in turns designates the sorting to > UnsafeExternalRowSorter. > UnsafeExternalRowSorter's implementation cleans up the resources when: > 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully > iterated. > 2. task is done execution. > In outer join case of SortMergeJoin, when the left or right iterator is not > fully iterated, the only chance for the resources to be cleaned up is at the > end of the spark task run. > This probably ok most of the time, however when a SortMergeOuterJoin is > nested within a CartesianProduct, the "deferred" resources cleanup allows a > none-ignorable memory leak amplified/cumulated by the loop driven by the > CartesianRdd's looping iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/16/16 12:39 AM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to com.esotericsoftware.kryo.KryoException was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error maybe related to spark-13862. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2208: --- Assignee: Apache Spark > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Patrick Wendell >Assignee: Apache Spark >Priority: Minor > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196556#comment-15196556 ] Apache Spark commented on SPARK-2208: - User 'joan38' has created a pull request for this issue: https://github.com/apache/spark/pull/11747 > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Patrick Wendell >Priority: Minor > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2208: --- Assignee: (was: Apache Spark) > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Patrick Wendell >Priority: Minor > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
[ https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196548#comment-15196548 ] Josh Rosen commented on SPARK-13920: I am not planning to work on this, so this task is up for grabs. > MIMA checks should apply to @Experimental and @DeveloperAPI APIs > > > Key: SPARK-13920 > URL: https://issues.apache.org/jira/browse/SPARK-13920 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen > > Our MIMA binary compatibility checks currently ignore APIs which are marked > as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes > sense. Even if those annotations _reserve_ the right to break binary > compatibility, we should still avoid compatibility breaks whenever possible > and should be informed explicitly when compatibility breaks. > As a result, we should update GenerateMIMAIgnore to stop ignoring classes and > methods which have those annotations. To remove the ignores, remove the > checks from > https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 > After removing the ignores, update {{project/MimaExcludes.scala}} to add > exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs
Josh Rosen created SPARK-13920: -- Summary: MIMA checks should apply to @Experimental and @DeveloperAPI APIs Key: SPARK-13920 URL: https://issues.apache.org/jira/browse/SPARK-13920 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Josh Rosen Our MIMA binary compatibility checks currently ignore APIs which are marked as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes sense. Even if those annotations _reserve_ the right to break binary compatibility, we should still avoid compatibility breaks whenever possible and should be informed explicitly when compatibility breaks. As a result, we should update GenerateMIMAIgnore to stop ignoring classes and methods which have those annotations. To remove the ignores, remove the checks from https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43 After removing the ignores, update {{project/MimaExcludes.scala}} to add exclusions for binary compatibility breaks introduced in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196537#comment-15196537 ] Apache Spark commented on SPARK-13602: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/11746 > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13602: Assignee: (was: Apache Spark) > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13602: Assignee: Apache Spark > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Apache Spark > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13919: Assignee: Apache Spark > Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject > - > > Key: SPARK-13919 > URL: https://issues.apache.org/jira/browse/SPARK-13919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each > other's effect. Although it will not cause the max iteration now, some > queries are not optimized to the best. > For example, in the following query, > {code} > val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) > val originalQuery = > input.select('a, 'b, 'c, 'd, > WindowExpression( > AggregateExpression(Count('b), Complete, isDistinct = false), > WindowSpecDefinition( 'a :: Nil, > SortOrder('b, Ascending) :: Nil, > UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) > {code} > After multiple iteration of two rules of {{ColumnPruning}} and > {{PushPredicateThroughProject}}, the optimized plan we generated is like: > {code} > Project [a#0,c#0] > > > +- Filter (window#0L > cast(1 as bigint)) > > >+- Project [a#0,c#0,window#0L] > > > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} > However, the expected optimized plan should be like: > {code} > Project [a#0,c#0] > +- Filter (window#0L > cast(1 as bigint)) >+- Project [a#0,c#0,window#0L] > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- Project [a#0,b#0,c#0] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196535#comment-15196535 ] Apache Spark commented on SPARK-13919: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11745 > Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject > - > > Key: SPARK-13919 > URL: https://issues.apache.org/jira/browse/SPARK-13919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each > other's effect. Although it will not cause the max iteration now, some > queries are not optimized to the best. > For example, in the following query, > {code} > val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) > val originalQuery = > input.select('a, 'b, 'c, 'd, > WindowExpression( > AggregateExpression(Count('b), Complete, isDistinct = false), > WindowSpecDefinition( 'a :: Nil, > SortOrder('b, Ascending) :: Nil, > UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) > {code} > After multiple iteration of two rules of {{ColumnPruning}} and > {{PushPredicateThroughProject}}, the optimized plan we generated is like: > {code} > Project [a#0,c#0] > > > +- Filter (window#0L > cast(1 as bigint)) > > >+- Project [a#0,c#0,window#0L] > > > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} > However, the expected optimized plan should be like: > {code} > Project [a#0,c#0] > +- Filter (window#0L > cast(1 as bigint)) >+- Project [a#0,c#0,window#0L] > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- Project [a#0,b#0,c#0] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13919: Assignee: (was: Apache Spark) > Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject > - > > Key: SPARK-13919 > URL: https://issues.apache.org/jira/browse/SPARK-13919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each > other's effect. Although it will not cause the max iteration now, some > queries are not optimized to the best. > For example, in the following query, > {code} > val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) > val originalQuery = > input.select('a, 'b, 'c, 'd, > WindowExpression( > AggregateExpression(Count('b), Complete, isDistinct = false), > WindowSpecDefinition( 'a :: Nil, > SortOrder('b, Ascending) :: Nil, > UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) > {code} > After multiple iteration of two rules of {{ColumnPruning}} and > {{PushPredicateThroughProject}}, the optimized plan we generated is like: > {code} > Project [a#0,c#0] > > > +- Filter (window#0L > cast(1 as bigint)) > > >+- Project [a#0,c#0,window#0L] > > > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} > However, the expected optimized plan should be like: > {code} > Project [a#0,c#0] > +- Filter (window#0L > cast(1 as bigint)) >+- Project [a#0,c#0,window#0L] > +- Window [(count(b#0),mode=Complete,isDistinct=false) > windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND > CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] > +- Project [a#0,b#0,c#0] > +- LocalRelation [a#0,b#0,c#0,d#0] > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13804) Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows
[ https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Nguyen reopened SPARK-13804: I posted this issue to user@ at http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-count-Major-Divergent-Non-Linear-Performance-Slowdown-when-data-set-increases-from-4-millis-td26493.html However, it has not been accepted by the mailing list yet. What needs to be done for it to be accepted ? And what is the typical turn-around for postings to be accepted ? > Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance > Slowdown going from 4million rows to 16+ million rows > - > > Key: SPARK-13804 > URL: https://issues.apache.org/jira/browse/SPARK-13804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes > - Each node is an EC2 with c3.4xlarge > - Each node has 16 cores and 30GB of RAM >Reporter: Michael Nguyen > > Spark SQL is used to load cvs files via com.databricks.spark.csv and then run > dataFrame.count() > In the same environment with plenty of CPU and RAM, Spark SQL takes > - 18.25 seconds to load a table with 4 millions vs > - 346.624 seconds (5.77 minutes) to load a table with 16 million rows. > Even though the number of rows increases by 4 times, the time it takes Spark > SQL to run dataframe.count () increases by 19.22 times. So the performance of > dataframe.count () diverges so drastically. > 1. Why does Spark SQL's performance not proportional to the number of rows > while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ? > 2. What can be done to fix this performance issue ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13919: Description: Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best. For example, in the following query, {code} val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) val originalQuery = input.select('a, 'b, 'c, 'd, WindowExpression( AggregateExpression(Count('b), Complete, isDistinct = false), WindowSpecDefinition( 'a :: Nil, SortOrder('b, Ascending) :: Nil, UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) {code} After multiple iteration of two rules of {{ColumnPruning}} and {{PushPredicateThroughProject}}, the optimized plan we generated is like: {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- LocalRelation [a#0,b#0,c#0,d#0] {code} However, the expected optimized plan should be like: {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- Project [a#0,b#0,c#0] +- LocalRelation [a#0,b#0,c#0,d#0] {code} was: Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best. For example, in the following query, {code} val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) val originalQuery = input.select('a, 'b, 'c, 'd, WindowExpression( AggregateExpression(Count('b), Complete, isDistinct = false), WindowSpecDefinition( 'a :: Nil, SortOrder('b, Ascending) :: Nil, UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) {code} {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- Project [a#0,b#0,c#0] +- LocalRelation [a#0,b#0,c#0,d#0] {code} {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- LocalRelation [a#0,b#0,c#0,d#0] {code} > Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject > - > > Key: SPARK-13919 > URL: https://issues.apache.org/jira/browse/SPARK-13919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Now, {{ColumnPruning}} a
[jira] [Created] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject
Xiao Li created SPARK-13919: --- Summary: Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject Key: SPARK-13919 URL: https://issues.apache.org/jira/browse/SPARK-13919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best. For example, in the following query, {code} val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) val originalQuery = input.select('a, 'b, 'c, 'd, WindowExpression( AggregateExpression(Count('b), Complete, isDistinct = false), WindowSpecDefinition( 'a :: Nil, SortOrder('b, Ascending) :: Nil, UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) {code} {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- Project [a#0,b#0,c#0] +- LocalRelation [a#0,b#0,c#0,d#0] {code} {code} Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- LocalRelation [a#0,b#0,c#0,d#0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8489) Add regression tests for SPARK-8470
[ https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196517#comment-15196517 ] Apache Spark commented on SPARK-8489: - User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11744 > Add regression tests for SPARK-8470 > --- > > Key: SPARK-8489 > URL: https://issues.apache.org/jira/browse/SPARK-8489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > See SPARK-8470 for more detail. Basically the Spark Hive code silently > overwrites the context class loader populated in SparkSubmit, resulting in > certain classes missing when we do reflection in `SQLContext#createDataFrame`. > That issue is already resolved in https://github.com/apache/spark/pull/6891, > but we should add a regression test for the specific manifestation of the bug > in SPARK-8470. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"
[ https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196516#comment-15196516 ] Apache Spark commented on SPARK-12653: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11744 > Re-enable test "SPARK-8489: MissingRequirementError during reflection" > -- > > Key: SPARK-12653 > URL: https://issues.apache.org/jira/browse/SPARK-12653 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > This test case was disabled in > https://github.com/apache/spark/pull/10569#discussion-diff-48813840 > I think we need to rebuild the jar because it was compiled against an old > version of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13118) Support for classes defined in package objects
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196487#comment-15196487 ] Reynold Xin commented on SPARK-13118: - The JIRA is the same. > Support for classes defined in package objects > -- > > Key: SPARK-13118 > URL: https://issues.apache.org/jira/browse/SPARK-13118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > When you define a class inside of a package object, the name ends up being > something like {{org.mycompany.project.package$MyClass}}. However, when > reflect on this we try and load {{org.mycompany.project.MyClass}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13118) Support for classes defined in package objects
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196463#comment-15196463 ] Jakob Odersky commented on SPARK-13118: --- Should I remove the JIRA ID from my existing PR? > Support for classes defined in package objects > -- > > Key: SPARK-13118 > URL: https://issues.apache.org/jira/browse/SPARK-13118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > When you define a class inside of a package object, the name ends up being > something like {{org.mycompany.project.package$MyClass}}. However, when > reflect on this we try and load {{org.mycompany.project.MyClass}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464 ] Dilip Biswal commented on SPARK-13821: -- [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems the there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196430#comment-15196430 ] Davies Liu commented on SPARK-13820: Unfortunately, subquery is not target for 2.0. > TPC-DS Query 10 fails to compile > > > Key: SPARK-13820 > URL: https://issues.apache.org/jira/browse/SPARK-13820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 10 fails to compile with the following error. > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Query is pasted here for easy reproduction > select > cd_gender, > cd_marital_status, > cd_education_status, > count(*) cnt1, > cd_purchase_estimate, > count(*) cnt2, > cd_credit_rating, > count(*) cnt3, > cd_dep_count, > count(*) cnt4, > cd_dep_employed_count, > count(*) cnt5, > cd_dep_college_count, > count(*) cnt6 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = > ss_wh1.ss_customer_sk > where > ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana > County','La Porte County') and >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > web_sales.ws_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13118) Support for classes defined in package objects
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196427#comment-15196427 ] Reynold Xin commented on SPARK-13118: - I'm moving this out to its own ticket. I think there is still a problem, and it lies in our way to get the full name of a class in ScalaReflection.scala {code} /** Returns the full class name for a type. */ def getClassNameFromType(tpe: `Type`): String = { tpe.erasure.typeSymbol.asClass.fullName } {code} According to the Scala doc here: http://www.scala-lang.org/api/2.11.7/scala-reflect/index.html#scala.reflect.api.Symbols$ClassSymbol {noformat} abstract def fullName: String The encoded full path name of this symbol, where outer names and inner names are separated by periods. {noformat} This causes problem with inner classes. For example {code} scala> Class.forName("org.apache.spark.mllib.tree.model.DecisionTreeModel.SaveLoadV1_0.SplitData") java.lang.ClassNotFoundException: org.apache.spark.mllib.tree.model.DecisionTreeModel.SaveLoadV1_0.SplitData at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) ... 49 elided scala> Class.forName("org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData") res6: Class[_] = class org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData {code} > Support for classes defined in package objects > -- > > Key: SPARK-13118 > URL: https://issues.apache.org/jira/browse/SPARK-13118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > When you define a class inside of a package object, the name ends up being > something like {{org.mycompany.project.package$MyClass}}. However, when > reflect on this we try and load {{org.mycompany.project.MyClass}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13118) Support for classes defined in package objects
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13118: Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-) > Support for classes defined in package objects > -- > > Key: SPARK-13118 > URL: https://issues.apache.org/jira/browse/SPARK-13118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > When you define a class inside of a package object, the name ends up being > something like {{org.mycompany.project.package$MyClass}}. However, when > reflect on this we try and load {{org.mycompany.project.MyClass}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-. Resolution: Fixed Fix Version/s: 2.0.0 > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > Fix For: 2.0.0 > > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196419#comment-15196419 ] Suresh Thalamati commented on SPARK-13820: -- This query contains correlated subquery, it is not supported yet in spark sql. [~davies] I saw your PR https://github.com/apache/spark/pull/10706 on these kind of query, are you planning to merge this for 2.0 ? > TPC-DS Query 10 fails to compile > > > Key: SPARK-13820 > URL: https://issues.apache.org/jira/browse/SPARK-13820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 10 fails to compile with the following error. > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Query is pasted here for easy reproduction > select > cd_gender, > cd_marital_status, > cd_education_status, > count(*) cnt1, > cd_purchase_estimate, > count(*) cnt2, > cd_credit_rating, > count(*) cnt3, > cd_dep_count, > count(*) cnt4, > cd_dep_employed_count, > count(*) cnt5, > cd_dep_college_count, > count(*) cnt6 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = > ss_wh1.ss_customer_sk > where > ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana > County','La Porte County') and >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > web_sales.ws_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/15/16 10:43 PM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to spark-13862. was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is gone.. Except that I saw execution error related to kryo.serializers.. that should be a different issue and maybe related to my setup. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/15/16 10:44 PM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error maybe related to spark-13862. was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to spark-13862. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/15/16 10:41 PM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is gone.. Except that I saw execution error related to kryo.serializers.. that should be a different issue and maybe related to my setup. was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not gone.. Except that I saw execution error related to kryo.serializers.. that should be a different issue and maybe related to my setup. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196376#comment-15196376 ] Apache Spark commented on SPARK-13918: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11743 > merge SortMergeJoin and SortMergeOuterJoin > -- > > Key: SPARK-13918 > URL: https://issues.apache.org/jira/browse/SPARK-13918 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, > should merge them and reduce the duplicated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13918: Assignee: Davies Liu (was: Apache Spark) > merge SortMergeJoin and SortMergeOuterJoin > -- > > Key: SPARK-13918 > URL: https://issues.apache.org/jira/browse/SPARK-13918 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, > should merge them and reduce the duplicated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13918: Assignee: Apache Spark (was: Davies Liu) > merge SortMergeJoin and SortMergeOuterJoin > -- > > Key: SPARK-13918 > URL: https://issues.apache.org/jira/browse/SPARK-13918 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, > should merge them and reduce the duplicated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377 ] Xin Wu commented on SPARK-13832: Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not gone.. Except that I saw execution error related to kryo.serializers.. that should be a different issue and maybe related to my setup. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org