[jira] [Assigned] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18106:


Assignee: Apache Spark

> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Assignee: Apache Spark
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18106:


Assignee: (was: Apache Spark)

> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607496#comment-15607496
 ] 

Apache Spark commented on SPARK-18106:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15640

> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18110:


Assignee: Felix Cheung  (was: Apache Spark)

> Missing parameter in Python for RandomForest regression and classification
> --
>
> Key: SPARK-18110
> URL: https://issues.apache.org/jira/browse/SPARK-18110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607444#comment-15607444
 ] 

Apache Spark commented on SPARK-18110:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15638

> Missing parameter in Python for RandomForest regression and classification
> --
>
> Key: SPARK-18110
> URL: https://issues.apache.org/jira/browse/SPARK-18110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18110:


Assignee: Apache Spark  (was: Felix Cheung)

> Missing parameter in Python for RandomForest regression and classification
> --
>
> Key: SPARK-18110
> URL: https://issues.apache.org/jira/browse/SPARK-18110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification

2016-10-25 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18110:


 Summary: Missing parameter in Python for RandomForest regression 
and classification
 Key: SPARK-18110
 URL: https://issues.apache.org/jira/browse/SPARK-18110
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18007) update SparkR MLP - add initalWeights parameter

2016-10-25 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18007.
--
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 2.1.0

> update SparkR MLP - add initalWeights parameter
> ---
>
> Key: SPARK-18007
> URL: https://issues.apache.org/jira/browse/SPARK-18007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> update SparkR MLP, add initalWeights parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607374#comment-15607374
 ] 

Dongjoon Hyun edited comment on SPARK-18106 at 10/26/16 4:31 AM:
-

Thank you for reporting this bug.
I'll make a PR to fix this.



was (Author: dongjoon):
Thank you for reporting this bug, [~srinathc]
I'll make a PR to fix this.


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607374#comment-15607374
 ] 

Dongjoon Hyun edited comment on SPARK-18106 at 10/26/16 4:30 AM:
-

Thank you for reporting this bug, [~srinathc]
I'll make a PR to fix this.



was (Author: dongjoon):
Thank you for reporting this bug, [~skomatir].
I'll make a PR to fix this.


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607374#comment-15607374
 ] 

Dongjoon Hyun commented on SPARK-18106:
---

Thank you for reporting this bug, [~skomatir].
I'll make a PR to fix this.


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18036) Decision Trees do not handle edge cases

2016-10-25 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607357#comment-15607357
 ] 

Weichen Xu commented on SPARK-18036:


i am working on this... 

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607343#comment-15607343
 ] 

Dilip Biswal commented on SPARK-18009:
--

[~smilegator][~jerryjung] [~martha.solarte] Thanks. I am testing a fix and 
should submit a PR for this soon.

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>  

[jira] [Closed] (SPARK-17881) Aggregation function for generating string histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang closed SPARK-17881.

Resolution: Duplicate

> Aggregation function for generating string histograms
> -
>
> Key: SPARK-17881
> URL: https://issues.apache.org/jira/browse/SPARK-17881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This agg function generates equi-width histograms for string type columns, 
> with a maximum number of histogram bins. It returns a empty result if the 
> ndv(number of distinct value) of the column exceeds the maximum number 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17881) Aggregation function for generating string histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607308#comment-15607308
 ] 

Zhenhua Wang commented on SPARK-17881:
--

This issue is included in another issue SPARK-18000, so I'll close this one.

> Aggregation function for generating string histograms
> -
>
> Key: SPARK-17881
> URL: https://issues.apache.org/jira/browse/SPARK-17881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This agg function generates equi-width histograms for string type columns, 
> with a maximum number of histogram bins. It returns a empty result if the 
> ndv(number of distinct value) of the column exceeds the maximum number 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18000:
-
Comment: was deleted

(was: This issue is included in another issue SPARK-17881, so I'll close this 
one.)

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607306#comment-15607306
 ] 

Zhenhua Wang commented on SPARK-18000:
--

This issue is included in another issue SPARK-17881, so I'll close this one.

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17074) generate histogram information for column

2016-10-25 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17074:
-
Description: 
We support two kinds of histograms: 
-   Equi-width histogram: We have a fixed width for each column interval in 
the histogram.  The height of a histogram represents the frequency for those 
column values in a specific interval.  For this kind of histogram, its height 
varies for different column intervals. We use the equi-width histogram when the 
number of distinct values is less than 254.
-   Equi-height histogram: For this histogram, the width of column interval 
varies.  The heights of all column intervals are the same.  The equi-height 
histogram is effective in handling skewed data distribution. We use the equi- 
height histogram when the number of distinct values is equal to or greater than 
254.  

We first use [SPARK-18000] to compute equi-width histograms (for both numeric 
and string types) or endpoints of equi-height histograms (for numeric type 
only). Then, if we get endpoints of a equi-height histogram, we need to compute 
ndv's between those endpoints by [SPARK-17997] to form the equi-height 
histogram.

This Jira incorporates three Jiras mentioned above to support needed 
aggregation functions. We need to resolve them before this one.

  was:
We support two kinds of histograms: 
-   Equi-width histogram: We have a fixed width for each column interval in 
the histogram.  The height of a histogram represents the frequency for those 
column values in a specific interval.  For this kind of histogram, its height 
varies for different column intervals. We use the equi-width histogram when the 
number of distinct values is less than 254.
-   Equi-height histogram: For this histogram, the width of column interval 
varies.  The heights of all column intervals are the same.  The equi-height 
histogram is effective in handling skewed data distribution. We use the equi- 
height histogram when the number of distinct values is equal to or greater than 
254.  

We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
(for both numeric and string types) or endpoints of equi-height histograms (for 
numeric type only). Then, if we get endpoints of a equi-height histogram, we 
need to compute ndv's between those endpoints by [SPARK-17997] to form the 
equi-height histogram.

This Jira incorporates three Jiras mentioned above to support needed 
aggregation functions. We need to resolve them before this one.


> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  
> We first use [SPARK-18000] to compute equi-width histograms (for both numeric 
> and string types) or endpoints of equi-height histograms (for numeric type 
> only). Then, if we get endpoints of a equi-height histogram, we need to 
> compute ndv's between those endpoints by [SPARK-17997] to form the 
> equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed 
> aggregation functions. We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18000:


Assignee: Apache Spark

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607297#comment-15607297
 ] 

Apache Spark commented on SPARK-18000:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15637

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18000:


Assignee: (was: Apache Spark)

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18109) Log instrumentation in GMM

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607289#comment-15607289
 ] 

Apache Spark commented on SPARK-18109:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15636

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18109) Log instrumentation in GMM

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18109:


Assignee: (was: Apache Spark)

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18109) Log instrumentation in GMM

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18109:


Assignee: Apache Spark

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18109) Log instrumentation in GMM

2016-10-25 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18109:


 Summary: Log instrumentation in GMM
 Key: SPARK-18109
 URL: https://issues.apache.org/jira/browse/SPARK-18109
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: zhengruifeng


Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18000:
-
Description: 
For a column, we will generate a equi-width or equi-height histogram, depending 
on if its ndv is large than the maximum number of bins allowed in one histogram 
(denoted as numBins).
The agg function for a column returns bins - (distinct value, frequency) pairs 
of equi-width histogram when the number of distinct values is less than or 
equal to numBins. Otherwise, 1) for column of string type, it returns an empty 
map; 2) for column of numeric type (including DateType and TimestampType), it 
returns endpoints of equi-height histogram - approximate percentiles at 
percentages 0.0, 1/numBins, 2/numBins, ..., (numBins-1)/numBins, 1.0.

  was:
For a column of numeric type (including date and timestamp), we will generate a 
equi-width or equi-height histogram, depending on if its ndv is large than the 
maximum number of bins allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap, 
whose size is less than or equal to "numBins", and returns an equi-width 
histogram. 
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes 
ApproximatePercentile to return endpoints of equi-height histogram.


> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram, 
> depending on if its ndv is large than the maximum number of bins allowed in 
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency) 
> pairs of equi-width histogram when the number of distinct values is less than 
> or equal to numBins. Otherwise, 1) for column of string type, it returns an 
> empty map; 2) for column of numeric type (including DateType and 
> TimestampType), it returns endpoints of equi-height histogram - approximate 
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., 
> (numBins-1)/numBins, 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-25 Thread zhangxinyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/26/16 3:26 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *KafkaSinkRDD*
*KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It 
extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to 
get or create producer to send data
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *KafkaSinkRDD*
*KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It 
extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to 
get or create producer to send data
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafkaSink")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18000) Aggregation function for computing endpoints for histograms

2016-10-25 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18000:
-
Summary: Aggregation function for computing endpoints for histograms  (was: 
Aggregation function for computing endpoints for numeric histograms)

> Aggregation function for computing endpoints for histograms
> ---
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> For a column of numeric type (including date and timestamp), we will generate 
> a equi-width or equi-height histogram, depending on if its ndv is large than 
> the maximum number of bins allowed in one histogram (denoted as numBins).
> This agg function computes values and their frequencies using a small 
> hashmap, whose size is less than or equal to "numBins", and returns an 
> equi-width histogram. 
> When the size of hashmap exceeds "numBins", it cleans the hashmap and 
> utilizes ApproximatePercentile to return endpoints of equi-height histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607204#comment-15607204
 ] 

Liang-Chi Hsieh commented on SPARK-18100:
-

Looks like Gson has no native support for json path?

> Improve the performance of get_json_object using Gson
> -
>
> Key: SPARK-18100
> URL: https://issues.apache.org/jira/browse/SPARK-18100
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Based on some benchmark here: 
> http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
>  which said Gson could be much faster than Jackson, maybe it could be used to 
> improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-10-25 Thread Richard Moorhead (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Moorhead updated SPARK-18108:
-
Attachment: stacktrace.out

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-10-25 Thread Richard Moorhead (JIRA)
Richard Moorhead created SPARK-18108:


 Summary: Partition discovery fails with explicitly written long 
partitions
 Key: SPARK-18108
 URL: https://issues.apache.org/jira/browse/SPARK-18108
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Richard Moorhead
Priority: Minor
 Attachments: stacktrace.out

We have parquet data written from Spark1.6 that, when read from 2.0.1, produces 
errors.
{code}
case class A(a: Long, b: Int)
val as = Seq(A(1,2))
//partition explicitly written
spark.createDataFrame(as).write.parquet("/data/a=1/")
spark.read.parquet("/data/").collect
{code}
The above code fails; stack trace attached. 

If an integer used, explicit partition discovery succeeds.
{code}
case class A(a: Int, b: Int)
val as = Seq(A(1,2))
//partition explicitly written
spark.createDataFrame(as).write.parquet("/data/a=1/")
spark.read.parquet("/data/").collect
{code}
The action succeeds. Additionally, if 'partitionBy' is used instead of explicit 
writes, partition discovery succeeds. 

Question: Is the first example a reasonable use case? 
[PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
 seems to default to Integer types unless the partition value exceeds the 
integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-25 Thread J.P Feng (JIRA)
J.P Feng created SPARK-18107:


 Summary: Insert overwrite statement runs much slower in spark-sql 
than it does in hive-client
 Key: SPARK-18107
 URL: https://issues.apache.org/jira/browse/SPARK-18107
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: spark 2.0.0
hive 2.0.1
Reporter: J.P Feng


I find insert overwrite statement running in spark-sql or spark-shell spends 
much more time than it does in  hive-client (i start it in 
apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
hive-client just costs less than 20 seconds.

These are the steps I took.

Test sql is :

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  
dt='2016-10-21' ;

there are 257128 lines of data in tbllog_login with 
partition(pt='mix_en',dt='2016-10-21')


ps:

I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
when doing overwrite, it need to spend a lot of time in io or in something else.

I also compare the executing time between insert overwrite statement and insert 
into statement.

1. insert overwrite statement and insert into statement in spark:

insert overwrite statement costs about 10 minutes
insert into statement costs about 30 seconds


2. insert into statement in spark and insert into statement in hive-client:

spark costs about 30 seconds
hive-client costs about 20 seconds
the difference is little that we can ignore





 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Jerryjung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607052#comment-15607052
 ] 

Jerryjung edited comment on SPARK-18009 at 10/26/16 1:44 AM:
-

Yes!
In my case, it's necessary option for integration with BI tools.


was (Author: jerryjung):
Yes!
But In my case, it's necessary option for integration with BI tools.

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at 

[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Jerryjung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607052#comment-15607052
 ] 

Jerryjung commented on SPARK-18009:
---

Yes!
But In my case, it's necessary option for integration with BI tools.

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> 

[jira] [Assigned] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18103:


Assignee: (was: Apache Spark)

> Rename *FileCatalog to *FileProvider
> 
>
> Key: SPARK-18103
> URL: https://issues.apache.org/jira/browse/SPARK-18103
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Minor
>
> In the SQL component there are too many different components called some 
> variant of *Catalog, which is quite confusing. We should rename the 
> subclasses of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18103:


Assignee: Apache Spark

> Rename *FileCatalog to *FileProvider
> 
>
> Key: SPARK-18103
> URL: https://issues.apache.org/jira/browse/SPARK-18103
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> In the SQL component there are too many different components called some 
> variant of *Catalog, which is quite confusing. We should rename the 
> subclasses of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607020#comment-15607020
 ] 

Apache Spark commented on SPARK-18103:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15634

> Rename *FileCatalog to *FileProvider
> 
>
> Key: SPARK-18103
> URL: https://issues.apache.org/jira/browse/SPARK-18103
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Minor
>
> In the SQL component there are too many different components called some 
> variant of *Catalog, which is quite confusing. We should rename the 
> subclasses of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow

2016-10-25 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng closed SPARK-18077.

Resolution: Won't Fix

i would try to open another one, for there are some mistakes in this issue.

> Run insert overwrite statements in spark  to overwrite a partitioned table is 
> very slow
> ---
>
> Key: SPARK-18077
> URL: https://issues.apache.org/jira/browse/SPARK-18077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0
> hive 2.0.1
> driver memory: 4g
> total executors: 4
> executor memory: 10g
> total cores: 13
>Reporter: J.P Feng
>  Labels: hive, insert, sparkSQL
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hello,all. I face a strange thing in my project.
> there is a table:
> CREATE TABLE `login4game`(`account_name` string, `role_id` string, 
> `server_id` string, `recdate` string)
> PARTITIONED BY (`pt` string, `dt` string) stored as orc;
> another table:
> CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` 
> string, `happened_time` int)
> PARTITIONED BY (`pt` string, `dt` string)
> --
> Test-1:
> executed sql in spark-shell or spark-sql( before i run this sql, there is 
> much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate from 
> tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
> it will cost a lot of time, below is a part of the logs:
> /
> [Stage 5:===>   (144 + 8) / 
> 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 
> 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, 
> real=0.08 secs] 
> [Stage 5:=> (152 + 8) / 
> 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 
> 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, 
> real=0.08 secs] 
> [Stage 5:>  (160 + 8) / 
> 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 
> 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, 
> real=0.07 secs] 
> [Stage 5:>  (176 + 8) / 
> 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 
> 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, 
> real=0.06 secs] 
> [Stage 5:>  (177 + 8) / 
> 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 
> 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, 
> real=0.06 secs] 
> [Stage 5:>  (192 + 8) / 
> 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 
> 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, 
> real=0.06 secs] 
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> ...
> Moved: 
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199'
>  to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /
> i can see, the origin data is moved to .trash
> and then, there is no log printing, and after about 10 min, the log print 
> again:
> /
> 16/10/24 17:24:15 INFO Hive: Replacing 
> 

[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18087:


Assignee: Apache Spark

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18087:


Assignee: (was: Apache Spark)

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606990#comment-15606990
 ] 

Apache Spark commented on SPARK-18087:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15633

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18087:


Assignee: (was: Apache Spark)

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18087:


Assignee: Apache Spark

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-18106:

Description: 
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not "noscan" produces an AnalyzeTableCommand with 
noscan=false

  was:
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not noscan produces an AnalyzeTableCommand with 
noscan=false


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-18106:

Description: 
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not noscan produces an AnalyzeTableCommand with 
noscan=false

  was:
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not {noformat}noscan{noformat} produces an 
AnalyzeTableCommand with {code}noscan=false{code}


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not noscan produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)
Srinath created SPARK-18106:
---

 Summary: Analyze Table accepts a garbage identifier at the end
 Key: SPARK-18106
 URL: https://issues.apache.org/jira/browse/SPARK-18106
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Srinath
Priority: Minor


{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not {noformat}noscan{noformat} produces an 
AnalyzeTableCommand with {code}noscan=false{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606896#comment-15606896
 ] 

Tathagata Das commented on SPARK-17829:
---

Based on [~tcondie] PR above, I think its better we also change the main common 
log class HDFSMetadataLog to use Json serialization rather than Java 
serialization. 

But this also means that we have to modify FileStreamSourceLog (subclass of 
HDFSMetadataLog[FileEntry]) to also use json serialization. Which is good to 
fix as well, as the file stream source log should also have a stable on-disk 
format and not depend on java serialization.

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606849#comment-15606849
 ] 

Apache Spark commented on SPARK-18105:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15632

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18105:


Assignee: Apache Spark  (was: Davies Liu)

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18105:


Assignee: Davies Liu  (was: Apache Spark)

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18105:
--

 Summary: LZ4 failed to decompress a stream of shuffled data
 Key: SPARK-18105
 URL: https://issues.apache.org/jira/browse/SPARK-18105
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Assignee: Davies Liu


When lz4 is used to compress the shuffle files, it may fail to decompress it as 
"stream is corrupt"

https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18104) Don't build KafkaSource doc

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18104:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Don't build KafkaSource doc
> ---
>
> Key: SPARK-18104
> URL: https://issues.apache.org/jira/browse/SPARK-18104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Don't need to build doc for KafkaSource because the user should use the data 
> source APIs to use KafkaSource. All KafkaSource APIs are internal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18104) Don't build KafkaSource doc

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606795#comment-15606795
 ] 

Apache Spark commented on SPARK-18104:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15630

> Don't build KafkaSource doc
> ---
>
> Key: SPARK-18104
> URL: https://issues.apache.org/jira/browse/SPARK-18104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Don't need to build doc for KafkaSource because the user should use the data 
> source APIs to use KafkaSource. All KafkaSource APIs are internal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18104) Don't build KafkaSource doc

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18104:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Don't build KafkaSource doc
> ---
>
> Key: SPARK-18104
> URL: https://issues.apache.org/jira/browse/SPARK-18104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Don't need to build doc for KafkaSource because the user should use the data 
> source APIs to use KafkaSource. All KafkaSource APIs are internal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18104) Don't build KafkaSource doc

2016-10-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18104:


 Summary: Don't build KafkaSource doc
 Key: SPARK-18104
 URL: https://issues.apache.org/jira/browse/SPARK-18104
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Don't need to build doc for KafkaSource because the user should use the data 
source APIs to use KafkaSource. All KafkaSource APIs are internal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606693#comment-15606693
 ] 

Xiao Li commented on SPARK-18009:
-

[~dkbiswal] Please fix it tonight. Thanks!

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> 

[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18009:

Labels: thrift  (was: sql thrift)

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210)
>   ... 15 more
> 

[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18009:

Target Version/s: 2.0.1, 2.1.0

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210)
>   ... 15 more
> Error: 

[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: (was: Spark Core)
 Web UI

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: (was: Spark Shell)

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: Spark Core

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-16988.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Assignee: chie hayashida

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15482) ClassCast exception when join two tables.

2016-10-25 Thread roberto sancho rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606601#comment-15606601
 ] 

roberto sancho rojas commented on SPARK-15482:
--

I have the same problem, whe i execute this code from spark 1.6 and HDP 
2.4.0.0-169 and PHOENIX 2.4.0
df = sqlContext.read \
  .format("org.apache.phoenix.spark") \
  .option("table", "TABLA") \
  .option("zkUrl", "XXX:/hbase-unsecure") \
  .load()
df.show()
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at 
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

here my claspathh:
/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-client.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-common.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-core-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar

> ClassCast exception when join two tables.
> -
>
> Key: SPARK-15482
> URL: https://issues.apache.org/jira/browse/SPARK-15482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Phoenix: 1.2
> Spark: 1.5.0-cdh5.5.1
>Reporter: jingtao
>
> I have two tables A and B in Phoenix.
> I load table 'A' as dataFrame 'ADF' using spark ,  and register dataFrame 
> ''ADF''  as temp table 'ATEMPTABLE'.
> B is the same as A.
> A --> ADF ---> ATEMPTABLE
> B --> BDF ---> BTEMPTABLE
> Then, i joins the two temp table 'ATEMPTABLE' and 'BTEMPTABLE' using spark 
> sql.
> Such as 'select count(*) from ATEMPTABLE join BTEMPTABLE on ...'
> It errors with the following message: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 6, hadoop05): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
> org.apache.spark.sql.Row
> at 
> org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at 

[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-25 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606523#comment-15606523
 ] 

Alex Bozarth commented on SPARK-18085:
--

I am *very* interested in working with you on this project and (post-Spark 
Summit) would love to discuss some of the UI ideas my team has been tossing 
around (a few covered in your non-goals).

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-25 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Environment: Pyspark 2.0.0, Ipython 4.2  (was: Pyspark 2.0.1, Ipython 4.2)

> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.0, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
> ^s)
>+- LogicalRDD [input#0L]
> {code}
> Executing test_def.show() after the above code in pyspark 2.0.1 yields:
> KeyError: 0
> Executing test_def.show() in pyspark 1.6.2 yields:
> {code}
> +-+--+
> |input|output|
> +-+--+
> |2|second|
> +-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606484#comment-15606484
 ] 

Nicholas Chammas commented on SPARK-18084:
--

cc [~marmbrus] - Dunno if this is actually bug or just an unsupported or 
inappropriate use case.

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> 

[jira] [Created] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18103:
--

 Summary: Rename *FileCatalog to *FileProvider
 Key: SPARK-18103
 URL: https://issues.apache.org/jira/browse/SPARK-18103
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang
Priority: Minor


In the SQL component there are too many different components called some 
variant of *Catalog, which is quite confusing. We should rename the subclasses 
of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18102) Failed to deserialize the result of task

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18102:
--

 Summary: Failed to deserialize the result of task
 Key: SPARK-18102
 URL: https://issues.apache.org/jira/browse/SPARK-18102
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
16/10/25 15:17:04 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
java.lang.ClassNotFoundException: org.apache.spark.util*SerializableBuffer not 
found in 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@3d98d138
at 
com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:578)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 

[jira] [Updated] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18101:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-17861

> ExternalCatalogSuite should test with mixed case fields
> ---
>
> Key: SPARK-18101
> URL: https://issues.apache.org/jira/browse/SPARK-18101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> Currently, it uses field names such as "a" and "b" which are not useful for 
> testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18101:
--

 Summary: ExternalCatalogSuite should test with mixed case fields
 Key: SPARK-18101
 URL: https://issues.apache.org/jira/browse/SPARK-18101
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Eric Liang


Currently, it uses field names such as "a" and "b" which are not useful for 
testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17471:


Assignee: Apache Spark

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606347#comment-15606347
 ] 

Apache Spark commented on SPARK-17471:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15628

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17471:


Assignee: (was: Apache Spark)

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18019) Log instrumentation in GBTs

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18019.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15574
[https://github.com/apache/spark/pull/15574]

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18100:
---
Issue Type: Improvement  (was: Bug)

> Improve the performance of get_json_object using Gson
> -
>
> Key: SPARK-18100
> URL: https://issues.apache.org/jira/browse/SPARK-18100
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Based on some benchmark here: 
> http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
>  which said Gson could be much faster than Jackson, maybe it could be used to 
> improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18100:
--

 Summary: Improve the performance of get_json_object using Gson
 Key: SPARK-18100
 URL: https://issues.apache.org/jira/browse/SPARK-18100
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Based on some benchmark here: 
http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
 which said Gson could be much faster than Jackson, maybe it could be used to 
improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Comment: was deleted

(was: Calling this a bug since FPR is not implemented correctly.)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18070.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15606
[https://github.com/apache/spark/pull/15606]

> binary operator should not consider nullability when comparing input types
> --
>
> Key: SPARK-18070
> URL: https://issues.apache.org/jira/browse/SPARK-18070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606178#comment-15606178
 ] 

Joseph K. Bradley commented on SPARK-17692:
---

[SPARK-17870] changes the output of ChiSqSelector.  It is a bug fix, so it is 
an acceptable change of behavior.

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  
> * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for 
> SelectKBest features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606146#comment-15606146
 ] 

Joseph K. Bradley commented on SPARK-18088:
---

How do you feel about renaming the selectorType values to match the parameters? 
 I'd like to call them "numTopFeatures", "percentile" and "fpr".

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Priority: Minor  (was: Major)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Issue Type: Improvement  (was: Bug)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606142#comment-15606142
 ] 

Joseph K. Bradley commented on SPARK-18088:
---

Ahh, you're right, sorry, I see that now that I'm looking at master.  I'll link 
the follow-up JIRA to the original JIRA.

And I agree my assertion about p-value wasn't correct.  Will fix.  Thanks!

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Description: 
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups

  was:
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups

One major item: FPR is not implemented correctly.  Testing against only the 
p-value and not the test statistic does not really tell you anything.  We 
should follow sklearn, which allows a p-value threshold for any selection 
method: 
[http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
* In this PR, I'm just going to remove FPR completely.  We can add it back in a 
follow-up PR.


> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18099:


Assignee: Apache Spark

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>Assignee: Apache Spark
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18099:


Assignee: (was: Apache Spark)

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606068#comment-15606068
 ] 

Apache Spark commented on SPARK-18099:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15627

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Kishor Patil (JIRA)
Kishor Patil created SPARK-18099:


 Summary: Spark distributed cache should throw exception if same 
file is specified to dropped in --files --archives
 Key: SPARK-18099
 URL: https://issues.apache.org/jira/browse/SPARK-18099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.1, 2.0.0
Reporter: Kishor Patil


Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
uploading to distributed cache
If by default yarn#client will upload all the --files and --archives in 
assembly to HDFS staging folder. It should throw if file appears in both 
--files and --archives exception to know whether uncompress or leave the file 
compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2016-10-25 Thread Matthew Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Porter updated SPARK-16183:
---
Affects Version/s: 2.0.0

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> 

[jira] [Resolved] (SPARK-18010) Remove unneeded heavy work performed by FsHistoryProvider for building up the application listing UI page

2016-10-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18010.

   Resolution: Fixed
 Assignee: Vinayak Joshi
Fix Version/s: 2.1.0

> Remove unneeded heavy work performed by FsHistoryProvider for building up the 
> application listing UI page
> -
>
> Key: SPARK-18010
> URL: https://issues.apache.org/jira/browse/SPARK-18010
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.6.2, 2.0.1, 2.1.0
>Reporter: Vinayak Joshi
>Assignee: Vinayak Joshi
> Fix For: 2.1.0
>
>
> There are known complaints/cribs about History Server's Application List not 
> updating quickly enough when the event log files that need replay are huge. 
> Currently, the FsHistoryProvider design causes the entire event log file to 
> be replayed when building the initial application listing (refer the method 
> mergeApplicationListing(fileStatus: FileStatus) ). The process of replay 
> involves:
>  - each line in the event log being read as a string,
>  - parsing the string to a Json structure
>  - converting the Json to the corresponding Scala classes with nested 
> structures
> Particularly the part involving parsing string to Json and then to Scala 
> classes is expensive. Tests show that majority of time spent in replay is in 
> doing this work. 
> When the replay is performed for building the application listing, the only 
> two events that the code really cares for are "SparkListenerApplicationStart" 
> and "SparkListenerApplicationEnd" - since the only listener attached to the 
> ReplayListenerBus at that point is the ApplicationEventListener. This means 
> that when processing an event log file with a huge number (hundreds of 
> thousands, can be more) of events, the work done to deserialize all of these 
> event,  and then replay them is not needed. Only two events are what we're 
> interested in, and this can be used to ensure that when replay is performed 
> for the purpose of building the application list, we only make the effort to 
> replay these two events and not others. 
> My tests show that this drastically improves application list load time. For 
> a 150MB event log from a user, with over 100,000 events, the load time (local 
> on my mac) comes down from about 16 secs to under 1 second using this 
> approach. For customers that typically execute applications with large event 
> logs, and thus have multiple large event logs present, this can speed up how 
> soon the history server UI lists the apps considerably.
> I will be updating a pull request with take at fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605898#comment-15605898
 ] 

Sean Owen commented on SPARK-18098:
---

It shouldn't work that way. The value is loaded in a lazy val, at least. I 
think I can imagine cases where you would end up with several per executor but 
they're not the normal use cases. Can you say more about what you're executing 
or what you're seeing?

> Broadcast creates 1 instance / core, not 1 instance / executor
> --
>
> Key: SPARK-18098
> URL: https://issues.apache.org/jira/browse/SPARK-18098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Anthony Sciola
>
> I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m 
> 55g
> When I run a job which broadcasts data, it appears each *thread* requests and 
> receives a copy of the broadcast object, not each *executor*. This means I 
> need 7x as much memory for the broadcasted item because I have 7 cores.
> The problem appears to be due to a lack of synchronization around requesting 
> broadcast items.
> The only workaround I've come up with is writing the data out to HDFS, 
> broadcasting the paths, and doing a synchronized load from HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605879#comment-15605879
 ] 

Apache Spark commented on SPARK-17829:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/15626

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17829:


Assignee: Tyson Condie  (was: Apache Spark)

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17829:


Assignee: Apache Spark  (was: Tyson Condie)

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor

2016-10-25 Thread Anthony Sciola (JIRA)
Anthony Sciola created SPARK-18098:
--

 Summary: Broadcast creates 1 instance / core, not 1 instance / 
executor
 Key: SPARK-18098
 URL: https://issues.apache.org/jira/browse/SPARK-18098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Anthony Sciola


I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m 55g

When I run a job which broadcasts data, it appears each *thread* requests and 
receives a copy of the broadcast object, not each *executor*. This means I need 
7x as much memory for the broadcasted item because I have 7 cores.

The problem appears to be due to a lack of synchronization around requesting 
broadcast items.

The only workaround I've come up with is writing the data out to HDFS, 
broadcasting the paths, and doing a synchronized load from HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2016-10-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605820#comment-15605820
 ] 

Marcelo Vanzin commented on SPARK-6951:
---

I reopened this after discussion in the bug; the other change (SPARK-18010) 
makes startup a little faster, but not necessarily fast, for large directories 
/ log files.

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6951) History server slow startup if the event log directory is large

2016-10-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-6951:
---

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605815#comment-15605815
 ] 

Marcelo Vanzin commented on SPARK-18085:


It's mildly related. The changes here don't do anything to help with the first 
SHS startup, which will still be slow. Subsequent startups would be fast, 
though, since the data would already be available locally.

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18097:
---
Description: 
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 10 of 
'ss:string:struct<>' but ':' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267)
at 
org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.(Dataset.scala:186)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at 

  1   2   >