date:20160315

[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-13855:
---

Assignee: Patrick Wendell  (was: Michael Armbrust)

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-15 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-13928:
---

 Summary: Move org.apache.spark.Logging into 
org.apache.spark.internal.Logging
 Key: SPARK-13928
 URL: https://issues.apache.org/jira/browse/SPARK-13928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin


Logging was made private in Spark 2.0. If we move it, then users would be able 
to create a Logging trait themselves to avoid changing their own code. 
Alternatively, we can also provide in a compatibility package that adds logging.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrix

2016-03-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13927:
--
Summary: Add row/column iterator to local matrix  (was: Add row/column 
iterator to matrix)

> Add row/column iterator to local matrix
> ---
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrices

2016-03-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13927:
--
Summary: Add row/column iterator to local matrices  (was: Add row/column 
iterator to local matrix)

> Add row/column iterator to local matrices
> -
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13927) Add row/column iterator to matrix

2016-03-15 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-13927:
-

 Summary: Add row/column iterator to matrix
 Key: SPARK-13927
 URL: https://issues.apache.org/jira/browse/SPARK-13927
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


Add row/column iterator to local matrices to simplify tasks like BlockMatrix => 
RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13764:


Assignee: (was: Apache Spark)

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196887#comment-15196887
 ] 

Apache Spark commented on SPARK-13764:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11756

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13764:


Assignee: Apache Spark

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:45 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195702#comment-15195702
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setUserTopKCol("userTopK").transform(users)
{code}


was (Author: mlnick):
Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setTopKCol("userId").transform(users)
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:41 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{predictTopK}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and save the resulting DF - so 
perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1
val topKItemsForUsers = model.setK(10).setTopKCol("userId").transform(df)

// Option 2
val topKItemsForUsers = model.predictTopK("userId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit more neatly 
into the {{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13899) Produce InternalRow instead of external Row

2016-03-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13899.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> Produce InternalRow instead of external Row
> ---
>
> Key: SPARK-13899
> URL: https://issues.apache.org/jira/browse/SPARK-13899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> Currently CSVRelation.parseCsv produces external {{Row}}.
> As described as a todo to avoid encoding, It would be great if this produces 
> {{InternalRow}} instead of external {{Row}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13926:


Assignee: Apache Spark  (was: Josh Rosen)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13926:


Assignee: Josh Rosen  (was: Apache Spark)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196857#comment-15196857
 ] 

Apache Spark commented on SPARK-13926:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11755

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13920.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13926) Automatically use Kryo serializer when it is known to be safe

2016-03-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13926:
--

 Summary: Automatically use Kryo serializer when it is known to be 
safe
 Key: SPARK-13926
 URL: https://issues.apache.org/jira/browse/SPARK-13926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Josh Rosen
Assignee: Josh Rosen


Because ClassTags are available when constructing ShuffledRDD we can use  them 
to automatically use Kryo for shuffle serialization when the RDD's types are 
guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
combiner types are primitives, arrays of primitives, or strings). This is 
likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13926:
---
Summary: Automatically use Kryo serializer when shuffling RDDs with simple 
types  (was: Automatically use Kryo serializer when it is known to be safe)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196805#comment-15196805
 ] 

Hari Shreedharan edited comment on SPARK-13877 at 3/16/16 5:49 AM:
---

[~c...@koeninger.org] - Sure. I agree with having one or more repos - each 
building against a set of compatible APIs. My point is whatever the case be - 
it is more flexible to do that outside Spark than have multiple 
codebases/modules inside.


was (Author: hshreedharan):
[~c...@koeninger.org] - Sure. I agree with having one or more repos - each 
building against a set of compatible APIs. My point is whatever the case be - 
it is more flexible to do that outside Spark than have multiple codebases 
inside.

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196805#comment-15196805
 ] 

Hari Shreedharan commented on SPARK-13877:
--

[~c...@koeninger.org] - Sure. I agree with having one or more repos - each 
building against a set of compatible APIs. My point is whatever the case be - 
it is more flexible to do that outside Spark than have multiple codebases 
inside.

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions

2016-03-15 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-13925:
-

 Summary: Expose R-like summary statistics in SparkR::glm for more 
family and link functions
 Key: SPARK-13925
 URL: https://issues.apache.org/jira/browse/SPARK-13925
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Xiangrui Meng


This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose 
R-like model summary in more family and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions

2016-03-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13925:
--
Priority: Critical  (was: Major)

> Expose R-like summary statistics in SparkR::glm for more family and link 
> functions
> --
>
> Key: SPARK-13925
> URL: https://issues.apache.org/jira/browse/SPARK-13925
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose 
> R-like model summary in more family and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9837) Provide R-like summary statistics for GLMs via iteratively reweighted least squares

2016-03-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9837.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11694
[https://github.com/apache/spark/pull/11694]

> Provide R-like summary statistics for GLMs via iteratively reweighted least 
> squares
> ---
>
> Key: SPARK-9837
> URL: https://issues.apache.org/jira/browse/SPARK-9837
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is similar to SPARK-9836 but for GLMs other than ordinary least squares.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13917) Generate code for broadcast left semi join

2016-03-15 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13917.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11742
[https://github.com/apache/spark/pull/11742]

> Generate code for broadcast left semi join
> --
>
> Key: SPARK-13917
> URL: https://issues.apache.org/jira/browse/SPARK-13917
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13903) Modify output nullability with constraints for Join and Filter operators

2016-03-15 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-13903:

Summary: Modify output nullability with constraints for Join and Filter 
operators  (was: Modify output nullability with constraints for Join operator)

> Modify output nullability with constraints for Join and Filter operators
> 
>
> Key: SPARK-13903
> URL: https://issues.apache.org/jira/browse/SPARK-13903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> With constraints and optimization, we can make sure some outputs of a Join 
> operator are not nulls. We should modify output nullability accordingly. We 
> can use this information in later execution to avoid null checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13903) Modify output nullability with constraints for Join and Filter operators

2016-03-15 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-13903:

Description: 
With constraints and optimization, we can make sure some outputs of a Join (or 
Filter) operator are not nulls. We should modify output nullability 
accordingly. We can use this information in later execution to avoid null 
checking.

Another reason to modify plan output is that we will use the output to 
determine schema. We should keep correct nullability in the schema.

  was:With constraints and optimization, we can make sure some outputs of a 
Join operator are not nulls. We should modify output nullability accordingly. 
We can use this information in later execution to avoid null checking.


> Modify output nullability with constraints for Join and Filter operators
> 
>
> Key: SPARK-13903
> URL: https://issues.apache.org/jira/browse/SPARK-13903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> With constraints and optimization, we can make sure some outputs of a Join 
> (or Filter) operator are not nulls. We should modify output nullability 
> accordingly. We can use this information in later execution to avoid null 
> checking.
> Another reason to modify plan output is that we will use the output to 
> determine schema. We should keep correct nullability in the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196765#comment-15196765
 ] 

Xusen Yin commented on SPARK-13641:
---

[~muralidh] I gonna close this JIRA since I find that it is intended to do so 
by One-hot encoder to index discrete feature. 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196765#comment-15196765
 ] 

Xusen Yin edited comment on SPARK-13641 at 3/16/16 5:00 AM:


I gonna close this JIRA since I find that it is intended to do so by One-hot 
encoder to index discrete feature. 


was (Author: yinxusen):
[~muralidh] I gonna close this JIRA since I find that it is intended to do so 
by One-hot encoder to index discrete feature. 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13924) officially support multi-insert

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13924:


Assignee: Apache Spark

> officially support multi-insert
> ---
>
> Key: SPARK-13924
> URL: https://issues.apache.org/jira/browse/SPARK-13924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13924) officially support multi-insert

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13924:


Assignee: (was: Apache Spark)

> officially support multi-insert
> ---
>
> Key: SPARK-13924
> URL: https://issues.apache.org/jira/browse/SPARK-13924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13924) officially support multi-insert

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196760#comment-15196760
 ] 

Apache Spark commented on SPARK-13924:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11754

> officially support multi-insert
> ---
>
> Key: SPARK-13924
> URL: https://issues.apache.org/jira/browse/SPARK-13924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196758#comment-15196758
 ] 

Apache Spark commented on SPARK-13316:
--

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/11753

> "SparkException: DStream has not been initialized" when restoring 
> StreamingContext from checkpoint and the dstream is created afterwards
> 
>
> Key: SPARK-13316
> URL: https://issues.apache.org/jira/browse/SPARK-13316
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I faced the issue today but [it was already reported on 
> SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the 
> reason is that a dstream is registered after a StreamingContext has been 
> recreated from checkpoint.
> It _appears_ that...no dstreams must be registered after a StreamingContext 
> has been recreated from checkpoint. It is *not* obvious at first.
> The code:
> {code}
> def createStreamingContext(): StreamingContext = {
> val ssc = new StreamingContext(sparkConf, Duration(1000))
> ssc.checkpoint(checkpointDir)
> ssc
> }
> val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
> val socketStream = ssc.socketTextStream(...)
> socketStream.checkpoint(Seconds(1))
> socketStream.foreachRDD(...)
> {code}
> It should be described in docs at the very least and/or checked in the code 
> when the streaming computation starts.
> The exception is as follows:
> {code}
> org.apache.spark.SparkException: 
> org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been 
> initialized
>   at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13316:


Assignee: Apache Spark

> "SparkException: DStream has not been initialized" when restoring 
> StreamingContext from checkpoint and the dstream is created afterwards
> 
>
> Key: SPARK-13316
> URL: https://issues.apache.org/jira/browse/SPARK-13316
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> I faced the issue today but [it was already reported on 
> SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the 
> reason is that a dstream is registered after a StreamingContext has been 
> recreated from checkpoint.
> It _appears_ that...no dstreams must be registered after a StreamingContext 
> has been recreated from checkpoint. It is *not* obvious at first.
> The code:
> {code}
> def createStreamingContext(): StreamingContext = {
> val ssc = new StreamingContext(sparkConf, Duration(1000))
> ssc.checkpoint(checkpointDir)
> ssc
> }
> val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
> val socketStream = ssc.socketTextStream(...)
> socketStream.checkpoint(Seconds(1))
> socketStream.foreachRDD(...)
> {code}
> It should be described in docs at the very least and/or checked in the code 
> when the streaming computation starts.
> The exception is as follows:
> {code}
> org.apache.spark.SparkException: 
> org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been 
> initialized
>   at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13316) "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13316:


Assignee: (was: Apache Spark)

> "SparkException: DStream has not been initialized" when restoring 
> StreamingContext from checkpoint and the dstream is created afterwards
> 
>
> Key: SPARK-13316
> URL: https://issues.apache.org/jira/browse/SPARK-13316
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I faced the issue today but [it was already reported on 
> SO|http://stackoverflow.com/q/35090180/1305344] a couple of days ago and the 
> reason is that a dstream is registered after a StreamingContext has been 
> recreated from checkpoint.
> It _appears_ that...no dstreams must be registered after a StreamingContext 
> has been recreated from checkpoint. It is *not* obvious at first.
> The code:
> {code}
> def createStreamingContext(): StreamingContext = {
> val ssc = new StreamingContext(sparkConf, Duration(1000))
> ssc.checkpoint(checkpointDir)
> ssc
> }
> val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
> val socketStream = ssc.socketTextStream(...)
> socketStream.checkpoint(Seconds(1))
> socketStream.foreachRDD(...)
> {code}
> It should be described in docs at the very least and/or checked in the code 
> when the streaming computation starts.
> The exception is as follows:
> {code}
> org.apache.spark.SparkException: 
> org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been 
> initialized
>   at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:589)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:585)
>   at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:585)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:579)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-15 Thread Ashok kumar Rajendran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196756#comment-15196756
 ] 

Ashok kumar Rajendran commented on SPARK-13900:
---

Along with dimensions condition, the above queries have some timestamp based 
filer as well. But it does not seem to affect the query plan much between these 
2 types of query.

I would highly appreciate any help on optimizing this query as this is a 
critical query in our application. 

> Spark SQL queries with OR condition is not optimized properly
> -
>
> Key: SPARK-13900
> URL: https://issues.apache.org/jira/browse/SPARK-13900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ashok kumar Rajendran
>
> I have a large table with few billions of rows and have a very small table 
> with 4 dimensional values. All the data is stored in parquet format. I would 
> like to get rows that match any of these dimensions. For example,
> Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR 
> A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = 
> B.dimension4.
> The query plan takes this as BroadcastNestedLoopJoin and executes for very 
> long time.
> If I execute this as Union queries, it takes around 1.5mins for each 
> dimension. Each query internally does BroadcastHashJoin.
> Select field1, field2 from A, B where A.dimension1 = B.dimension1
> UNION ALL
> Select field1, field2 from A, B where A.dimension2 = B.dimension2
> UNION ALL
> Select field1, field2 from A, B where  A.dimension3 = B.dimension3
> UNION ALL
> Select field1, field2 from A, B where  A.dimension4 = B.dimension4.
> This is obviously not an optimal solution as it makes multiple scanning at 
> same table but it gives result much better than OR condition. 
> Seems the SQL optimizer is not working properly which causes huge performance 
> impact on this type of OR query.
> Please correct me if I miss anything here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3308) Ability to read JSON Arrays as tables

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196754#comment-15196754
 ] 

Apache Spark commented on SPARK-3308:
-

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11752

> Ability to read JSON Arrays as tables
> -
>
> Key: SPARK-3308
> URL: https://issues.apache.org/jira/browse/SPARK-3308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now we can only read json where each object is on its own line.  It 
> would be nice to be able to read top level json arrays where each element 
> maps to a row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-15 Thread Ashok kumar Rajendran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196753#comment-15196753
 ] 

Ashok kumar Rajendran commented on SPARK-13900:
---

Explain plan for query with OR condition is as below.

Explain execution 
16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions 
out of 24, pruned 0.0% partitions.
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_8 stored as values 
in memory (estimated size 73.3 KB, free 511.5 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 516.5 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 8 from explain at 
JavaSparkSQL.java:682
16/03/15 21:00:23 INFO storage.MemoryStore: Block broadcast_9 stored as values 
in memory (estimated size 73.3 KB, free 589.8 KB)
16/03/15 21:00:23 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 594.7 KB)
16/03/15 21:00:23 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:23 INFO spark.SparkContext: Created broadcast 9 from explain at 
JavaSparkSQL.java:682
== Parsed Logical Plan ==
'Project 
[unresolvedalias('TableA_Dimension1),unresolvedalias('TableA_Dimension2),unresolvedalias('TableA_Dimension3),unresolvedalias('TableA_timestamp_millis),unresolvedalias('TableA_field40),unresolvedalias('TableB_dimension1
 AS TableB_dimension1#248),unresolvedalias('TableB_dimension2 AS 
inv_ua#249),unresolvedalias('TableB_dimension3 AS 
TableB_dimension3#250),unresolvedalias('TableB_timestamp_mills AS 
TableB_timestamp_mills#251)]
+- 'Filter (('TableA_Dimension1 = 'TableB_dimension1) || 
('TableA_Dimension3 = 'TableB_dimension3)) || ('TableA_Dimension2 = 
'TableB_dimension2)) && ('TableA_timestamp_millis >= 'TableB_timestamp_mills)) 
&& ('TableA_timestamp_millis <= ('TableB_timestamp_mills + 360))) && 
('TableA_partition_hour_bucket >= 'TableB_partition_hour_bucket))
   +- 'Join Inner, None
  :- 'UnresolvedRelation `TableA`, None
  +- 'UnresolvedRelation `TableB`, None

== Analyzed Logical Plan ==
TableA_Dimension1: string, TableA_Dimension2: string, TableA_Dimension3: 
string, TableA_timestamp_millis: string, TableA_field40: string, 
TableB_dimension1: string, inv_ua: string, TableB_dimension3: string, 
TableB_timestamp_mills: string
Project 
[TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162
 AS TableB_dimension1#248,TableB_dimension2#155 AS 
inv_ua#249,TableB_dimension3#156 AS 
TableB_dimension3#250,TableB_timestamp_mills#157 AS TableB_timestamp_mills#251]
+- Filter ((TableA_Dimension1#74 = TableB_dimension1#162) || 
(TableA_Dimension3#68 = TableB_dimension3#156)) || (TableA_Dimension2#94 = 
TableB_dimension2#155)) && (TableA_timestamp_millis#38 >= 
TableB_timestamp_mills#157)) && (cast(TableA_timestamp_millis#38 as double) <= 
(cast(TableB_timestamp_mills#157 as double) + cast(360 as double && 
(TableA_partition_hour_bucket#153 >= TableB_partition_hour_bucket#159))
   +- Join Inner, None
  :- Subquery TableA
  :  +- 
Relation[TableA_field0#0,TableA_field1#1,TableA_field2#2,TableA_field3#3,TableA_field4#4,TableA_field5#5,TableA_field6#6,TableA_field7#7,TableA_field8#8,TableA_field9#9,TableA_field10#10,TableA_field11#11,TableA_field12#12,TableA_field13#13,TableA_field14#14,TableA_field15#15,,TableA_field150#150]
 ParquetRelation
  +- Subquery TableB
 +- 
Relation[timeBucket#154,TableB_dimension2#155,TableB_dimension3#156,TableB_timestamp_mills#157,endTime#158,TableB_partition_hour_bucket#159,endTimeBucket#160,count#161L,TableB_dimension1#162]
 ParquetRelation

== Optimized Logical Plan ==
Project 
[TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162
 AS TableB_dimension1#248,TableB_dimension2#155 AS 
inv_ua#249,TableB_dimension3#156 AS 
TableB_dimension3#250,TableB_timestamp_mills#157 AS TableB_timestamp_mills#251]
+- Join Inner, Some(((TableA_Dimension1#74 = TableB_dimension1#162) || 
(TableA_Dimension3#68 = TableB_dimension3#156)) || (TableA_Dimension2#94 = 
TableB_dimension2#155)) && (TableA_timestamp_millis#38 >= 
TableB_timestamp_mills#157)) && (cast(TableA_timestamp_millis#38 as double) <= 
(cast(TableB_timestamp_mills#157 as double) + 360.0))) && 
(TableA_partition_hour_bucket#153 >= TableB_partition_hour_bucket#159)))
   :- Project 
[TableA_timestamp_millis#38,TableA_Dimension1#74,TableA_field40#40,TableA_Dimension3#68,TableA_partition_hour_bucket#153,TableA_Dimension2#94]
   :  +- 
Relation[Ta

[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-15 Thread Ashok kumar Rajendran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196752#comment-15196752
 ] 

Ashok kumar Rajendran commented on SPARK-13900:
---

Hi Xio Li,

Thanks for looking into this. Here is the explain plan for these 2 queries. 
(TableA is a big table with 150 fields, I just shortened here to reduce text 
size)

Execution plan for Union Query.
Explain execution 
16/03/15 21:00:21 INFO datasources.DataSourceStrategy: Selected 24 partitions 
out of 24, pruned 0.0% partitions.
16/03/15 21:00:21 INFO storage.MemoryStore: Block broadcast_2 stored as values 
in memory (estimated size 24.1 KB, free 41.9 KB)
16/03/15 21:00:21 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 46.9 KB)
16/03/15 21:00:21 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:21 INFO spark.SparkContext: Created broadcast 2 from explain at 
JavaSparkSQL.java:676
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_3 stored as values 
in memory (estimated size 73.3 KB, free 120.2 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 125.1 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 3 from explain at 
JavaSparkSQL.java:676
16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions 
out of 24, pruned 0.0% partitions.
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_4 stored as values 
in memory (estimated size 73.3 KB, free 198.5 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 203.4 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 4 from explain at 
JavaSparkSQL.java:676
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_5 stored as values 
in memory (estimated size 73.3 KB, free 276.7 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 281.7 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 5 from explain at 
JavaSparkSQL.java:676
16/03/15 21:00:22 INFO datasources.DataSourceStrategy: Selected 24 partitions 
out of 24, pruned 0.0% partitions.
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_6 stored as values 
in memory (estimated size 73.3 KB, free 355.0 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 359.9 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 6 from explain at 
JavaSparkSQL.java:676
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_7 stored as values 
in memory (estimated size 73.3 KB, free 433.3 KB)
16/03/15 21:00:22 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as 
bytes in memory (estimated size 4.9 KB, free 438.2 KB)
16/03/15 21:00:22 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in 
memory on 10.88.12.80:50492 (size: 4.9 KB, free: 7.0 GB)
16/03/15 21:00:22 INFO spark.SparkContext: Created broadcast 7 from explain at 
JavaSparkSQL.java:676
== Parsed Logical Plan ==
Union
:- Union
:  :- Project 
[TableA_Dimension1#74,TableA_Dimension2#94,TableA_Dimension3#68,TableA_timestamp_millis#38,TableA_field40#40,TableB_dimension1#162
 AS TableB_dimension1#163,TableB_dimension2#155 AS 
inv_ua#164,TableB_dimension3#156 AS 
TableB_dimension3#165,TableB_timestamp_mills#157 AS TableB_timestamp_mills#166]
:  :  +- Filter (((TableA_timestamp_millis#38 >= TableB_timestamp_mills#157) && 
(cast(TableA_timestamp_millis#38 as double) <= (cast(TableB_timestamp_mills#157 
as double) + cast(360 as double && (TableA_partition_hour_bucket#153 >= 
TableB_partition_hour_bucket#159))
:  : +- Join Inner, Some((TableA_Dimension1#74 = TableB_dimension1#162))
:  ::- Subquery TableA
:  ::  +- 
Relation[TableA_field0#0,TableA_field1#1,TableA_field2#2,TableA_field3#3,TableA_field4#4,TableA_field5#5,TableA_field6#6,TableA_field7#7,TableA_field8#8,TableA_field9#9,TableA_field10#10,TableA_field11#11,TableA_field12#12,TableA_field13#13,TableA_field14#14,TableA_field15#15,,TableA_field150#150]
 ParquetRelation
:  :+- Subquery TableB
:  :   +-

[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-15 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464
 ] 

Dilip Biswal edited comment on SPARK-13821 at 3/16/16 4:41 AM:
---

[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems that there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?


was (Author: dkbiswal):
[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems the there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13924) officially support multi-insert

2016-03-15 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-13924:
---

 Summary: officially support multi-insert
 Key: SPARK-13924
 URL: https://issues.apache.org/jira/browse/SPARK-13924
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13920:


Assignee: Apache Spark

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196743#comment-15196743
 ] 

Apache Spark commented on SPARK-13920:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11751

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13920:


Assignee: (was: Apache Spark)

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin

2016-03-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13918.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> merge SortMergeJoin and SortMergeOuterJoin
> --
>
> Key: SPARK-13918
> URL: https://issues.apache.org/jira/browse/SPARK-13918
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, 
> should merge them and reduce the duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

2016-03-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196665#comment-15196665
 ] 

Hyukjin Kwon commented on SPARK-13764:
--

The issue SPARK-3308 is related with supporting each row wrapped with an array.

For other types, this works like the PERMISSIVE mode in CSV data source but 
only when actual data is an array and the given data type is {{StructType}}, it 
emits an exception above. 

This is because JSON data source converts each data to a desirable type by the 
combination of Jackson parser's token and given data type but the combination 
of {{START_OBJECT}} and {{ArrayType}} exists for the issue above.

So, if the given schema is {{StructType}} and given actual data is an array, 
the combination of {{START_OBJECT}} and {{ArrayType}} is applied to this case, 
which causes the exception above.


> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SPARK-13843) Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages

2016-03-15 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196661#comment-15196661
 ] 

Liwei Lin edited comment on SPARK-13843 at 3/16/16 2:50 AM:


hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) 
out -- is this left out on purpose or should we also move that out? Thanks!


was (Author: proflin):
hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) 
out -- why is that please?

> Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, 
> streaming-twitter to Spark packages
> ---
>
> Key: SPARK-13843
> URL: https://issues.apache.org/jira/browse/SPARK-13843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Currently there are a few sub-projects, each for integrating with different 
> external sources for Streaming.  Now that we have better ability to include 
> external libraries (Spark packages) and with Spark 2.0 coming up, we can move 
> the following projects out of Spark to https://github.com/spark-packages
> - streaming-flume
> - streaming-akka
> - streaming-mqtt
> - streaming-zeromq
> - streaming-twitter
> They are just some ancillary packages and considering the overhead of 
> maintenance, running tests and PR failures, it's better to maintain them out 
> of Spark. In addition, these projects can have their different release cycles 
> and we can release them faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13923:


Assignee: Andrew Or  (was: Apache Spark)

> Implement SessionCatalog to manage temp functions and tables
> 
>
> Key: SPARK-13923
> URL: https://issues.apache.org/jira/browse/SPARK-13923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Today, we have ExternalCatalog, which is dead code. As part of the effort of 
> merging SQLContext/HiveContext we'll parse Hive commands and call the 
> corresponding methods in ExternalCatalog.
> However, this handles only persisted things. We need something in addition to 
> that to handle temporary things. The new catalog is called SessionCatalog and 
> will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196664#comment-15196664
 ] 

Apache Spark commented on SPARK-13923:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11750

> Implement SessionCatalog to manage temp functions and tables
> 
>
> Key: SPARK-13923
> URL: https://issues.apache.org/jira/browse/SPARK-13923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Today, we have ExternalCatalog, which is dead code. As part of the effort of 
> merging SQLContext/HiveContext we'll parse Hive commands and call the 
> corresponding methods in ExternalCatalog.
> However, this handles only persisted things. We need something in addition to 
> that to handle temporary things. The new catalog is called SessionCatalog and 
> will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13923:


Assignee: Apache Spark  (was: Andrew Or)

> Implement SessionCatalog to manage temp functions and tables
> 
>
> Key: SPARK-13923
> URL: https://issues.apache.org/jira/browse/SPARK-13923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Today, we have ExternalCatalog, which is dead code. As part of the effort of 
> merging SQLContext/HiveContext we'll parse Hive commands and call the 
> corresponding methods in ExternalCatalog.
> However, this handles only persisted things. We need something in addition to 
> that to handle temporary things. The new catalog is called SessionCatalog and 
> will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13843) Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages

2016-03-15 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196661#comment-15196661
 ] 

Liwei Lin commented on SPARK-13843:
---

hi [~zsxwing], we didn't move streaming-kinesis (which is also under external) 
out -- why is that please?

> Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, 
> streaming-twitter to Spark packages
> ---
>
> Key: SPARK-13843
> URL: https://issues.apache.org/jira/browse/SPARK-13843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Currently there are a few sub-projects, each for integrating with different 
> external sources for Streaming.  Now that we have better ability to include 
> external libraries (Spark packages) and with Spark 2.0 coming up, we can move 
> the following projects out of Spark to https://github.com/spark-packages
> - streaming-flume
> - streaming-akka
> - streaming-mqtt
> - streaming-zeromq
> - streaming-twitter
> They are just some ancillary packages and considering the overhead of 
> maintenance, running tests and PR failures, it's better to maintain them out 
> of Spark. In addition, these projects can have their different release cycles 
> and we can release them faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-15 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13923:
-

 Summary: Implement SessionCatalog to manage temp functions and 
tables
 Key: SPARK-13923
 URL: https://issues.apache.org/jira/browse/SPARK-13923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or


Today, we have ExternalCatalog, which is dead code. As part of the effort of 
merging SQLContext/HiveContext we'll parse Hive commands and call the 
corresponding methods in ExternalCatalog.

However, this handles only persisted things. We need something in addition to 
that to handle temporary things. The new catalog is called SessionCatalog and 
will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196626#comment-15196626
 ] 

Dongjoon Hyun edited comment on SPARK-13920 at 3/16/16 2:16 AM:


Oh, [~joshrosen]. 
I totally agree with you on the purpose of this issue. 
I'll start to work this issue.


was (Author: dongjoon):
Oh, [~joshrosen]. 
I totally agree with the purpose of this issue. 
I'll start to work this issue.

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196626#comment-15196626
 ] 

Dongjoon Hyun commented on SPARK-13920:
---

Oh, [~joshrosen]. 
I totally agree with the purpose of this issue. 
I'll start to work this issue.

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13915) Allow bin/spark-submit to be called via symbolic link

2016-03-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196590#comment-15196590
 ] 

Saisai Shao commented on SPARK-13915:
-

I suppose it should be worked if you set {{SPARK_HOME}} in your environment. 
For the vendor release version, I think you need to check whether script is 
changed or not, that may be slightly different from community version.

> Allow bin/spark-submit to be called via symbolic link
> -
>
> Key: SPARK-13915
> URL: https://issues.apache.org/jira/browse/SPARK-13915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
> Environment: CentOS 6.6
> Tarbal spark distribution and CDH-5.x.x Spark version (both)
>Reporter: Rafael Pecin Ferreira
>Priority: Minor
>
> We have a CDH-5 cluster that comes with spark-1.5.0 and we needed to use 
> spark-1.5.1 for bug fix issues.
> When I set up the spark (out of the CDH box) to the system alternatives, it 
> created a sequence of symbolic links to the target spark installation.
> When I tried to run spark-submit, the bash process call the target with "$0" 
> as /usr/bin/spark-submit, but this script use the "$0" variable to locate its 
> deps and I was facing this messages:
> [hdfs@server01 ~]$ env spark-submit
> ls: cannot access /usr/assembly/target/scala-2.10: No such file or directory
> Failed to find Spark assembly in /usr/assembly/target/scala-2.10.
> You need to build Spark before running this program.
> I fixed the spark-submit script adding this lines:
> if [ -h "$0" ] ; then
> checklink="$0";
> while [ -h "$checklink" ] ; do
> checklink=`readlink $checklink`
> done
> SPARK_HOME="$(cd "`dirname "$checklink"`"/..; pwd)";
> else
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)";
> fi
> It would be very nice if this piece of code be put into the spark-submit 
> script to allow us to have multiple spark alternatives on the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13872) Memory leak in SortMergeOuterJoin

2016-03-15 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196585#comment-15196585
 ] 

Yin Huai commented on SPARK-13872:
--

Also cc [~davies]

> Memory leak in SortMergeOuterJoin
> -
>
> Key: SPARK-13872
> URL: https://issues.apache.org/jira/browse/SPARK-13872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ian
> Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png
>
>
> SortMergeJoin composes its partition/iterator from 
> org.apache.spark.sql.execution.Sort, which in turns designates the sorting to 
> UnsafeExternalRowSorter.
> UnsafeExternalRowSorter's implementation cleans up the resources when:
> 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully 
> iterated.
> 2. task is done execution.
> In outer join case of SortMergeJoin, when the left or right iterator is not 
> fully iterated, the only chance for the resources to be cleaned up is at the 
> end of the spark task run. 
> This probably ok most of the time, however when a SortMergeOuterJoin is 
> nested within a CartesianProduct, the "deferred" resources cleanup allows a 
> none-ignorable memory leak amplified/cumulated by the loop driven by the 
> CartesianRdd's looping iteration.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13914) Add functionality to back up spark event logs

2016-03-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196580#comment-15196580
 ] 

Saisai Shao commented on SPARK-13914:
-

Agree with [~srowen], this feature is quite user specific, actually you may 
have many other ways to address your problem, rather than putting into spark 
code, this will increase the maintaining overhead, and I believe most of the 
users will not use this functionality.

> Add functionality to back up spark event logs
> -
>
> Key: SPARK-13914
> URL: https://issues.apache.org/jira/browse/SPARK-13914
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0, 1.6.2, 2.0.0
>Reporter: Parag Chaudhari
>
> Spark event logs are usually stored in HDFS when running Spark on YARN. In a 
> cloud environment, these HDFS files are often stored on the disks of 
> ephemeral instances that could go away once the instances are terminated. 
> Users may want to persist the event logs as the event happens for issue 
> investigation and performance analysis before and after the cluster is 
> terminated. The backup path can be managed by the spark users based on their 
> needs. For example, some users may copy the event logs to a cloud storage 
> service directly and keep them there forever. While some other users may want 
> to store the event logs on local disks and back them up to a cloud storage 
> service from time to time. Other users will not want to use the feature, so 
> this feature should be off by default; users enable the feature when and only 
> when they need it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13922:


Assignee: (was: Apache Spark)

> Filter rows with null attributes in parquet vectorized reader
> -
>
> Key: SPARK-13922
> URL: https://issues.apache.org/jira/browse/SPARK-13922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>
> It's common for many SQL operators to not care about reading null values for 
> correctness. Currently, this is achieved by performing `isNotNull` checks 
> (for  all relevant columns) on a per-row basis. Pushing these null filters in 
> parquet vectorized reader should bring considerable benefits (especially for 
> cases when the underlying data doesn't contain any nulls or contains all 
> nulls).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196577#comment-15196577
 ] 

Apache Spark commented on SPARK-13922:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11749

> Filter rows with null attributes in parquet vectorized reader
> -
>
> Key: SPARK-13922
> URL: https://issues.apache.org/jira/browse/SPARK-13922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>
> It's common for many SQL operators to not care about reading null values for 
> correctness. Currently, this is achieved by performing `isNotNull` checks 
> (for  all relevant columns) on a per-row basis. Pushing these null filters in 
> parquet vectorized reader should bring considerable benefits (especially for 
> cases when the underlying data doesn't contain any nulls or contains all 
> nulls).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13922:


Assignee: Apache Spark

> Filter rows with null attributes in parquet vectorized reader
> -
>
> Key: SPARK-13922
> URL: https://issues.apache.org/jira/browse/SPARK-13922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> It's common for many SQL operators to not care about reading null values for 
> correctness. Currently, this is achieved by performing `isNotNull` checks 
> (for  all relevant columns) on a per-row basis. Pushing these null filters in 
> parquet vectorized reader should bring considerable benefits (especially for 
> cases when the underlying data doesn't contain any nulls or contains all 
> nulls).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13922) Filter rows with null attributes in parquet vectorized reader

2016-03-15 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-13922:
--

 Summary: Filter rows with null attributes in parquet vectorized 
reader
 Key: SPARK-13922
 URL: https://issues.apache.org/jira/browse/SPARK-13922
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sameer Agarwal


It's common for many SQL operators to not care about reading null values for 
correctness. Currently, this is achieved by performing `isNotNull` checks (for  
all relevant columns) on a per-row basis. Pushing these null filters in parquet 
vectorized reader should bring considerable benefits (especially for cases when 
the underlying data doesn't contain any nulls or contains all nulls).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13921:


Assignee: Josh Rosen  (was: Apache Spark)

> Store serialized blocks as multiple chunks in MemoryStore
> -
>
> Key: SPARK-13921
> URL: https://issues.apache.org/jira/browse/SPARK-13921
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Instead of storing serialized blocks in individual ByteBuffers, the 
> BlockManager should be capable of storing a serialized block in multiple 
> chunks, each occupying a separate ByteBuffer.
> This change will help to improve the efficiency of memory allocation and the 
> accuracy of memory accounting when serializing blocks. Our current 
> serialization code uses a {{ByteBufferOutputStream}}, which doubles and 
> re-allocates its backing byte array; this increases the peak memory 
> requirements during serialization (since we need to hold extra memory while 
> expanding the array). In addition, we currently don't account for the extra 
> wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte 
> serialized block may actually consume 256 megabytes of memory. After 
> switching to storing blocks in multiple chunks, we'll be able to efficiently 
> trim the backing buffers so that no space is wasted.
> This change is also a prerequisite to being able to cache blocks which are 
> larger than 2GB (although full support for that depends on several other 
> changes which have not bee implemented yet).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196567#comment-15196567
 ] 

Apache Spark commented on SPARK-13921:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11748

> Store serialized blocks as multiple chunks in MemoryStore
> -
>
> Key: SPARK-13921
> URL: https://issues.apache.org/jira/browse/SPARK-13921
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Instead of storing serialized blocks in individual ByteBuffers, the 
> BlockManager should be capable of storing a serialized block in multiple 
> chunks, each occupying a separate ByteBuffer.
> This change will help to improve the efficiency of memory allocation and the 
> accuracy of memory accounting when serializing blocks. Our current 
> serialization code uses a {{ByteBufferOutputStream}}, which doubles and 
> re-allocates its backing byte array; this increases the peak memory 
> requirements during serialization (since we need to hold extra memory while 
> expanding the array). In addition, we currently don't account for the extra 
> wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte 
> serialized block may actually consume 256 megabytes of memory. After 
> switching to storing blocks in multiple chunks, we'll be able to efficiently 
> trim the backing buffers so that no space is wasted.
> This change is also a prerequisite to being able to cache blocks which are 
> larger than 2GB (although full support for that depends on several other 
> changes which have not bee implemented yet).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13921:


Assignee: Apache Spark  (was: Josh Rosen)

> Store serialized blocks as multiple chunks in MemoryStore
> -
>
> Key: SPARK-13921
> URL: https://issues.apache.org/jira/browse/SPARK-13921
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Instead of storing serialized blocks in individual ByteBuffers, the 
> BlockManager should be capable of storing a serialized block in multiple 
> chunks, each occupying a separate ByteBuffer.
> This change will help to improve the efficiency of memory allocation and the 
> accuracy of memory accounting when serializing blocks. Our current 
> serialization code uses a {{ByteBufferOutputStream}}, which doubles and 
> re-allocates its backing byte array; this increases the peak memory 
> requirements during serialization (since we need to hold extra memory while 
> expanding the array). In addition, we currently don't account for the extra 
> wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte 
> serialized block may actually consume 256 megabytes of memory. After 
> switching to storing blocks in multiple chunks, we'll be able to efficiently 
> trim the backing buffers so that no space is wasted.
> This change is also a prerequisite to being able to cache blocks which are 
> larger than 2GB (although full support for that depends on several other 
> changes which have not bee implemented yet).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13921) Store serialized blocks as multiple chunks in MemoryStore

2016-03-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13921:
--

 Summary: Store serialized blocks as multiple chunks in MemoryStore
 Key: SPARK-13921
 URL: https://issues.apache.org/jira/browse/SPARK-13921
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen


Instead of storing serialized blocks in individual ByteBuffers, the 
BlockManager should be capable of storing a serialized block in multiple 
chunks, each occupying a separate ByteBuffer.

This change will help to improve the efficiency of memory allocation and the 
accuracy of memory accounting when serializing blocks. Our current 
serialization code uses a {{ByteBufferOutputStream}}, which doubles and 
re-allocates its backing byte array; this increases the peak memory 
requirements during serialization (since we need to hold extra memory while 
expanding the array). In addition, we currently don't account for the extra 
wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte 
serialized block may actually consume 256 megabytes of memory. After switching 
to storing blocks in multiple chunks, we'll be able to efficiently trim the 
backing buffers so that no space is wasted.

This change is also a prerequisite to being able to cache blocks which are 
larger than 2GB (although full support for that depends on several other 
changes which have not bee implemented yet).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13872) Memory leak in SortMergeOuterJoin

2016-03-15 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196561#comment-15196561
 ] 

Mark Hamstra commented on SPARK-13872:
--

[~joshrosen]

> Memory leak in SortMergeOuterJoin
> -
>
> Key: SPARK-13872
> URL: https://issues.apache.org/jira/browse/SPARK-13872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ian
> Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png
>
>
> SortMergeJoin composes its partition/iterator from 
> org.apache.spark.sql.execution.Sort, which in turns designates the sorting to 
> UnsafeExternalRowSorter.
> UnsafeExternalRowSorter's implementation cleans up the resources when:
> 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully 
> iterated.
> 2. task is done execution.
> In outer join case of SortMergeJoin, when the left or right iterator is not 
> fully iterated, the only chance for the resources to be cleaned up is at the 
> end of the spark task run. 
> This probably ok most of the time, however when a SortMergeOuterJoin is 
> nested within a CartesianProduct, the "deferred" resources cleanup allows a 
> none-ignorable memory leak amplified/cumulated by the loop driven by the 
> CartesianRdd's looping iteration.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/16/16 12:39 AM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to  
com.esotericsoftware.kryo.KryoException


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error maybe related to spark-13862.

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2208) local metrics tests can fail on fast machines

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2208:
---

Assignee: Apache Spark

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196556#comment-15196556
 ] 

Apache Spark commented on SPARK-2208:
-

User 'joan38' has created a pull request for this issue:
https://github.com/apache/spark/pull/11747

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Patrick Wendell
>Priority: Minor
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2208) local metrics tests can fail on fast machines

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2208:
---

Assignee: (was: Apache Spark)

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Patrick Wendell
>Priority: Minor
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196548#comment-15196548
 ] 

Josh Rosen commented on SPARK-13920:


I am not planning to work on this, so this task is up for grabs.

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13920:
--

 Summary: MIMA checks should apply to @Experimental and 
@DeveloperAPI APIs
 Key: SPARK-13920
 URL: https://issues.apache.org/jira/browse/SPARK-13920
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Josh Rosen


Our MIMA binary compatibility checks currently ignore APIs which are marked as 
{{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes sense. 
Even if those annotations _reserve_ the right to break binary compatibility, we 
should still avoid compatibility breaks whenever possible and should be 
informed explicitly when compatibility breaks.

As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
methods which have those annotations. To remove the ignores, remove the checks 
from 
https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43

After removing the ignores, update {{project/MimaExcludes.scala}} to add 
exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196537#comment-15196537
 ] 

Apache Spark commented on SPARK-13602:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11746

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13602:


Assignee: (was: Apache Spark)

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13602:


Assignee: Apache Spark

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13919:


Assignee: Apache Spark

> Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject 
> -
>
> Key: SPARK-13919
> URL: https://issues.apache.org/jira/browse/SPARK-13919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each 
> other's effect. Although it will not cause the max iteration now, some 
> queries are not optimized to the best. 
> For example, in the following query, 
> {code}
> val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
> val originalQuery =
>   input.select('a, 'b, 'c, 'd,
> WindowExpression(
>   AggregateExpression(Count('b), Complete, isDistinct = false),
>   WindowSpecDefinition( 'a :: Nil,
> SortOrder('b, Ascending) :: Nil,
> UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
> {code}
> After multiple iteration of two rules of {{ColumnPruning}} and 
> {{PushPredicateThroughProject}}, the optimized plan we generated is like:
> {code}
> Project [a#0,c#0] 
>   
>  
> +- Filter (window#0L > cast(1 as bigint)) 
>   
>  
>+- Project [a#0,c#0,window#0L] 
>   
>  
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
>  +- LocalRelation [a#0,b#0,c#0,d#0]   
>   
>  
> {code}
> However, the expected optimized plan should be like:
> {code}
> Project [a#0,c#0]
> +- Filter (window#0L > cast(1 as bigint))
>+- Project [a#0,c#0,window#0L]
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
>  +- Project [a#0,b#0,c#0]
> +- LocalRelation [a#0,b#0,c#0,d#0]
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196535#comment-15196535
 ] 

Apache Spark commented on SPARK-13919:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11745

> Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject 
> -
>
> Key: SPARK-13919
> URL: https://issues.apache.org/jira/browse/SPARK-13919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each 
> other's effect. Although it will not cause the max iteration now, some 
> queries are not optimized to the best. 
> For example, in the following query, 
> {code}
> val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
> val originalQuery =
>   input.select('a, 'b, 'c, 'd,
> WindowExpression(
>   AggregateExpression(Count('b), Complete, isDistinct = false),
>   WindowSpecDefinition( 'a :: Nil,
> SortOrder('b, Ascending) :: Nil,
> UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
> {code}
> After multiple iteration of two rules of {{ColumnPruning}} and 
> {{PushPredicateThroughProject}}, the optimized plan we generated is like:
> {code}
> Project [a#0,c#0] 
>   
>  
> +- Filter (window#0L > cast(1 as bigint)) 
>   
>  
>+- Project [a#0,c#0,window#0L] 
>   
>  
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
>  +- LocalRelation [a#0,b#0,c#0,d#0]   
>   
>  
> {code}
> However, the expected optimized plan should be like:
> {code}
> Project [a#0,c#0]
> +- Filter (window#0L > cast(1 as bigint))
>+- Project [a#0,c#0,window#0L]
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
>  +- Project [a#0,b#0,c#0]
> +- LocalRelation [a#0,b#0,c#0,d#0]
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13919:


Assignee: (was: Apache Spark)

> Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject 
> -
>
> Key: SPARK-13919
> URL: https://issues.apache.org/jira/browse/SPARK-13919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each 
> other's effect. Although it will not cause the max iteration now, some 
> queries are not optimized to the best. 
> For example, in the following query, 
> {code}
> val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
> val originalQuery =
>   input.select('a, 'b, 'c, 'd,
> WindowExpression(
>   AggregateExpression(Count('b), Complete, isDistinct = false),
>   WindowSpecDefinition( 'a :: Nil,
> SortOrder('b, Ascending) :: Nil,
> UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
> {code}
> After multiple iteration of two rules of {{ColumnPruning}} and 
> {{PushPredicateThroughProject}}, the optimized plan we generated is like:
> {code}
> Project [a#0,c#0] 
>   
>  
> +- Filter (window#0L > cast(1 as bigint)) 
>   
>  
>+- Project [a#0,c#0,window#0L] 
>   
>  
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
>  +- LocalRelation [a#0,b#0,c#0,d#0]   
>   
>  
> {code}
> However, the expected optimized plan should be like:
> {code}
> Project [a#0,c#0]
> +- Filter (window#0L > cast(1 as bigint))
>+- Project [a#0,c#0,window#0L]
>   +- Window [(count(b#0),mode=Complete,isDistinct=false) 
> windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
> CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
>  +- Project [a#0,b#0,c#0]
> +- LocalRelation [a#0,b#0,c#0,d#0]
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13804) Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows

2016-03-15 Thread Michael Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nguyen reopened SPARK-13804:


I posted this issue to user@ at

http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-count-Major-Divergent-Non-Linear-Performance-Slowdown-when-data-set-increases-from-4-millis-td26493.html

However, it has not been accepted by the mailing list yet. What needs to be 
done for it to be accepted ? And what is the typical turn-around for postings 
to be accepted ?


> Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance 
> Slowdown going from 4million rows to 16+ million rows
> -
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
>Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run 
> dataFrame.count() 
> In the same environment with plenty of CPU and RAM, Spark SQL takes 
> - 18.25 seconds to load  a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark 
> SQL to run dataframe.count () increases by 19.22 times. So the performance of 
> dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows 
> while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix  this performance issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject

2016-03-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13919:

Description: 
Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's 
effect. Although it will not cause the max iteration now, some queries are not 
optimized to the best. 

For example, in the following query, 
{code}
val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
val originalQuery =
  input.select('a, 'b, 'c, 'd,
WindowExpression(
  AggregateExpression(Count('b), Complete, isDistinct = false),
  WindowSpecDefinition( 'a :: Nil,
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
{code}

After multiple iteration of two rules of {{ColumnPruning}} and 
{{PushPredicateThroughProject}}, the optimized plan we generated is like:
{code}
Project [a#0,c#0]   

 
+- Filter (window#0L > cast(1 as bigint))   

 
   +- Project [a#0,c#0,window#0L]   

 
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
 +- LocalRelation [a#0,b#0,c#0,d#0] 

 
{code}

However, the expected optimized plan should be like:
{code}
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
 +- Project [a#0,b#0,c#0]
+- LocalRelation [a#0,b#0,c#0,d#0]  


{code}



  was:
Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's 
effect. Although it will not cause the max iteration now, some queries are not 
optimized to the best. 

For example, in the following query, 
{code}
val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
val originalQuery =
  input.select('a, 'b, 'c, 'd,
WindowExpression(
  AggregateExpression(Count('b), Complete, isDistinct = false),
  WindowSpecDefinition( 'a :: Nil,
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
{code}

{code}
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
 +- Project [a#0,b#0,c#0]
+- LocalRelation [a#0,b#0,c#0,d#0]  


{code}

{code}
Project [a#0,c#0]   

 
+- Filter (window#0L > cast(1 as bigint))   

 
   +- Project [a#0,c#0,window#0L]   

 
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
 +- LocalRelation [a#0,b#0,c#0,d#0] 

 
{code}


> Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject 
> -
>
> Key: SPARK-13919
> URL: https://issues.apache.org/jira/browse/SPARK-13919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Now, {{ColumnPruning}} a

[jira] [Created] (SPARK-13919) Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject

2016-03-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-13919:
---

 Summary: Resolving the Conflicts of ColumnPruning and 
PushPredicateThroughProject 
 Key: SPARK-13919
 URL: https://issues.apache.org/jira/browse/SPARK-13919
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Now, {{ColumnPruning}} and {{PushPredicateThroughProject}} reverse each other's 
effect. Although it will not cause the max iteration now, some queries are not 
optimized to the best. 

For example, in the following query, 
{code}
val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
val originalQuery =
  input.select('a, 'b, 'c, 'd,
WindowExpression(
  AggregateExpression(Count('b), Complete, isDistinct = false),
  WindowSpecDefinition( 'a :: Nil,
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)
{code}

{code}
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
 +- Project [a#0,b#0,c#0]
+- LocalRelation [a#0,b#0,c#0,d#0]  


{code}

{code}
Project [a#0,c#0]   

 
+- Filter (window#0L > cast(1 as bigint))   

 
   +- Project [a#0,c#0,window#0L]   

 
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
 +- LocalRelation [a#0,b#0,c#0,d#0] 

 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8489) Add regression tests for SPARK-8470

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196517#comment-15196517
 ] 

Apache Spark commented on SPARK-8489:
-

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11744

> Add regression tests for SPARK-8470
> ---
>
> Key: SPARK-8489
> URL: https://issues.apache.org/jira/browse/SPARK-8489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> See SPARK-8470 for more detail. Basically the Spark Hive code silently 
> overwrites the context class loader populated in SparkSubmit, resulting in 
> certain classes missing when we do reflection in `SQLContext#createDataFrame`.
> That issue is already resolved in https://github.com/apache/spark/pull/6891, 
> but we should add a regression test for the specific manifestation of the bug 
> in SPARK-8470.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196516#comment-15196516
 ] 

Apache Spark commented on SPARK-12653:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11744

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196487#comment-15196487
 ] 

Reynold Xin commented on SPARK-13118:
-

The JIRA is the same.


> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-15 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196463#comment-15196463
 ] 

Jakob Odersky commented on SPARK-13118:
---

Should I remove the JIRA ID from my existing PR?

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-15 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464
 ] 

Dilip Biswal commented on SPARK-13821:
--

[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems the there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-15 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196430#comment-15196430
 ] 

Davies Liu commented on SPARK-13820:


 Unfortunately, subquery is not target for 2.0.

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196427#comment-15196427
 ] 

Reynold Xin commented on SPARK-13118:
-

I'm moving this out to its own ticket. 

I think there is still a problem, and it lies in our way to get the full name 
of a class in ScalaReflection.scala
{code}
  /** Returns the full class name for a type. */
  def getClassNameFromType(tpe: `Type`): String = {
tpe.erasure.typeSymbol.asClass.fullName
  }
{code}

According to the Scala doc here: 
http://www.scala-lang.org/api/2.11.7/scala-reflect/index.html#scala.reflect.api.Symbols$ClassSymbol
{noformat}
abstract def fullName: String
The encoded full path name of this symbol, where outer names and inner names 
are separated by periods.
{noformat}

This causes problem with inner classes. For example
{code}
scala> 
Class.forName("org.apache.spark.mllib.tree.model.DecisionTreeModel.SaveLoadV1_0.SplitData")
java.lang.ClassNotFoundException: 
org.apache.spark.mllib.tree.model.DecisionTreeModel.SaveLoadV1_0.SplitData
  at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:264)
  ... 49 elided

scala> 
Class.forName("org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData")
res6: Class[_] = class 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData
{code}

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13118) Support for classes defined in package objects

2016-03-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13118:

Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-)

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2016-03-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-15 Thread Suresh Thalamati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196419#comment-15196419
 ] 

Suresh Thalamati commented on SPARK-13820:
--

This query contains correlated subquery, it is not supported yet in spark sql.  

[~davies] I saw your PR https://github.com/apache/spark/pull/10706 on these 
kind of query,  are you planning to merge this for 2.0 ?

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/15/16 10:43 PM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to spark-13862.


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is gone.. Except that I 
saw execution error related to kryo.serializers.. that should be a different 
issue and maybe related to my setup. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/15/16 10:44 PM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error maybe related to spark-13862.


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to spark-13862.

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/15/16 10:41 PM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is gone.. Except that I 
saw execution error related to kryo.serializers.. that should be a different 
issue and maybe related to my setup. 


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not gone.. Except that 
I saw execution error related to kryo.serializers.. that should be a different 
issue and maybe related to my setup. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin

2016-03-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196376#comment-15196376
 ] 

Apache Spark commented on SPARK-13918:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11743

> merge SortMergeJoin and SortMergeOuterJoin
> --
>
> Key: SPARK-13918
> URL: https://issues.apache.org/jira/browse/SPARK-13918
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, 
> should merge them and reduce the duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13918:


Assignee: Davies Liu  (was: Apache Spark)

> merge SortMergeJoin and SortMergeOuterJoin
> --
>
> Key: SPARK-13918
> URL: https://issues.apache.org/jira/browse/SPARK-13918
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, 
> should merge them and reduce the duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13918) merge SortMergeJoin and SortMergeOuterJoin

2016-03-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13918:


Assignee: Apache Spark  (was: Davies Liu)

> merge SortMergeJoin and SortMergeOuterJoin
> --
>
> Key: SPARK-13918
> URL: https://issues.apache.org/jira/browse/SPARK-13918
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> There are lots of duplicated code in SortMergeJoin and SortMergeOuterJoin, 
> should merge them and reduce the duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196377#comment-15196377
 ] 

Xin Wu commented on SPARK-13832:


Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not gone.. Except that 
I saw execution error related to kryo.serializers.. that should be a different 
issue and maybe related to my setup. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 308 matches

Mail list logo