[jira] [Updated] (SPARK-18111) Wrong ApproximatePercentile answer when multiple records have the minimum value
[ https://issues.apache.org/jira/browse/SPARK-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-18111: - Description: When multiple records have the minimum value, the answer of ApproximatePercentile is wrong. For example, the following query returns 2.0 for percentile 0.5, but the correct answer should be 1.0 0: jdbc:hive2://localhost:1> select key from src2; +--+--+ | key | +--+--+ | 1| | 1| | 2| | 2| +--+--+ 4 rows selected (0.185 seconds) 0: jdbc:hive2://localhost:1> select percentile_approx(key, array(0.5)) from src2; ++--+ | percentile_approx(CAST(key AS DOUBLE), array(0.5), 1) | ++--+ | [2.0] | ++--+ 1 row selected (0.292 seconds) was: When multiple records have the minimum value, the answer of ApproximatePercentile is wrong. e.g: 0: jdbc:hive2://localhost:1> select key from src2; +--+--+ | key | +--+--+ | 1| | 1| | 2| | 2| +--+--+ 4 rows selected (0.185 seconds) 0: jdbc:hive2://localhost:1> select percentile_approx(key, array(0.5)) from src2; ++--+ | percentile_approx(CAST(key AS DOUBLE), array(0.5), 1) | ++--+ | [2.0] | ++--+ 1 row selected (0.292 seconds) > Wrong ApproximatePercentile answer when multiple records have the minimum > value > --- > > Key: SPARK-18111 > URL: https://issues.apache.org/jira/browse/SPARK-18111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Zhenhua Wang > > When multiple records have the minimum value, the answer of > ApproximatePercentile is wrong. > For example, the following query returns 2.0 for percentile 0.5, but the > correct answer should be 1.0 > 0: jdbc:hive2://localhost:1> select key from src2; > +--+--+ > | key | > +--+--+ > | 1| > | 1| > | 2| > | 2| > +--+--+ > 4 rows selected (0.185 seconds) > 0: jdbc:hive2://localhost:1> select percentile_approx(key, array(0.5)) > from src2; > ++--+ > | percentile_approx(CAST(key AS DOUBLE), array(0.5), 1) | > ++--+ > | [2.0] | > ++--+ > 1 row selected (0.292 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18111) Wrong ApproximatePercentile answer when multiple records have the minimum value
Zhenhua Wang created SPARK-18111: Summary: Wrong ApproximatePercentile answer when multiple records have the minimum value Key: SPARK-18111 URL: https://issues.apache.org/jira/browse/SPARK-18111 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Zhenhua Wang When multiple records have the minimum value, the answer of ApproximatePercentile is wrong. e.g: 0: jdbc:hive2://localhost:1> select key from src2; +--+--+ | key | +--+--+ | 1| | 1| | 2| | 2| +--+--+ 4 rows selected (0.185 seconds) 0: jdbc:hive2://localhost:1> select percentile_approx(key, array(0.5)) from src2; ++--+ | percentile_approx(CAST(key AS DOUBLE), array(0.5), 1) | ++--+ | [2.0] | ++--+ 1 row selected (0.292 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18106: Assignee: Apache Spark > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Assignee: Apache Spark >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18106: Assignee: (was: Apache Spark) > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607496#comment-15607496 ] Apache Spark commented on SPARK-18106: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15640 > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification
[ https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18110: Assignee: Felix Cheung (was: Apache Spark) > Missing parameter in Python for RandomForest regression and classification > -- > > Key: SPARK-18110 > URL: https://issues.apache.org/jira/browse/SPARK-18110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Felix Cheung >Assignee: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification
[ https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18110: Assignee: Apache Spark (was: Felix Cheung) > Missing parameter in Python for RandomForest regression and classification > -- > > Key: SPARK-18110 > URL: https://issues.apache.org/jira/browse/SPARK-18110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification
[ https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607444#comment-15607444 ] Apache Spark commented on SPARK-18110: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/15638 > Missing parameter in Python for RandomForest regression and classification > -- > > Key: SPARK-18110 > URL: https://issues.apache.org/jira/browse/SPARK-18110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Felix Cheung >Assignee: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification
Felix Cheung created SPARK-18110: Summary: Missing parameter in Python for RandomForest regression and classification Key: SPARK-18110 URL: https://issues.apache.org/jira/browse/SPARK-18110 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.1 Reporter: Felix Cheung Assignee: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18007) update SparkR MLP - add initalWeights parameter
[ https://issues.apache.org/jira/browse/SPARK-18007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-18007. -- Resolution: Fixed Assignee: Weichen Xu Fix Version/s: 2.1.0 > update SparkR MLP - add initalWeights parameter > --- > > Key: SPARK-18007 > URL: https://issues.apache.org/jira/browse/SPARK-18007 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > update SparkR MLP, add initalWeights parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607374#comment-15607374 ] Dongjoon Hyun edited comment on SPARK-18106 at 10/26/16 4:31 AM: - Thank you for reporting this bug. I'll make a PR to fix this. was (Author: dongjoon): Thank you for reporting this bug, [~srinathc] I'll make a PR to fix this. > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607374#comment-15607374 ] Dongjoon Hyun edited comment on SPARK-18106 at 10/26/16 4:30 AM: - Thank you for reporting this bug, [~srinathc] I'll make a PR to fix this. was (Author: dongjoon): Thank you for reporting this bug, [~skomatir]. I'll make a PR to fix this. > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607374#comment-15607374 ] Dongjoon Hyun commented on SPARK-18106: --- Thank you for reporting this bug, [~skomatir]. I'll make a PR to fix this. > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18036) Decision Trees do not handle edge cases
[ https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607357#comment-15607357 ] Weichen Xu commented on SPARK-18036: i am working on this... > Decision Trees do not handle edge cases > --- > > Key: SPARK-18036 > URL: https://issues.apache.org/jira/browse/SPARK-18036 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > Decision trees/GBT/RF do not handle edge cases such as constant features or > empty features. For example: > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) > at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) > at > org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207) > at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > ... 52 elided > {code} > as well as > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.maxBy > at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236) > at > scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37) > at > org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607343#comment-15607343 ] Dilip Biswal commented on SPARK-18009: -- [~smilegator][~jerryjung] [~martha.solarte] Thanks. I am testing a fix and should submit a PR for this soon. > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLExcept
[jira] [Closed] (SPARK-17881) Aggregation function for generating string histograms
[ https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang closed SPARK-17881. Resolution: Duplicate > Aggregation function for generating string histograms > - > > Key: SPARK-17881 > URL: https://issues.apache.org/jira/browse/SPARK-17881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This agg function generates equi-width histograms for string type columns, > with a maximum number of histogram bins. It returns a empty result if the > ndv(number of distinct value) of the column exceeds the maximum number > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17881) Aggregation function for generating string histograms
[ https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607308#comment-15607308 ] Zhenhua Wang commented on SPARK-17881: -- This issue is included in another issue SPARK-18000, so I'll close this one. > Aggregation function for generating string histograms > - > > Key: SPARK-17881 > URL: https://issues.apache.org/jira/browse/SPARK-17881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This agg function generates equi-width histograms for string type columns, > with a maximum number of histogram bins. It returns a empty result if the > ndv(number of distinct value) of the column exceeds the maximum number > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-18000: - Comment: was deleted (was: This issue is included in another issue SPARK-17881, so I'll close this one.) > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607306#comment-15607306 ] Zhenhua Wang commented on SPARK-18000: -- This issue is included in another issue SPARK-17881, so I'll close this one. > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17074) generate histogram information for column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-17074: - Description: We support two kinds of histograms: - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254. - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254. We first use [SPARK-18000] to compute equi-width histograms (for both numeric and string types) or endpoints of equi-height histograms (for numeric type only). Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those endpoints by [SPARK-17997] to form the equi-height histogram. This Jira incorporates three Jiras mentioned above to support needed aggregation functions. We need to resolve them before this one. was: We support two kinds of histograms: - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254. - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254. We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms (for both numeric and string types) or endpoints of equi-height histograms (for numeric type only). Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those endpoints by [SPARK-17997] to form the equi-height histogram. This Jira incorporates three Jiras mentioned above to support needed aggregation functions. We need to resolve them before this one. > generate histogram information for column > - > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We support two kinds of histograms: > - Equi-width histogram: We have a fixed width for each column interval in > the histogram. The height of a histogram represents the frequency for those > column values in a specific interval. For this kind of histogram, its height > varies for different column intervals. We use the equi-width histogram when > the number of distinct values is less than 254. > - Equi-height histogram: For this histogram, the width of column interval > varies. The heights of all column intervals are the same. The equi-height > histogram is effective in handling skewed data distribution. We use the equi- > height histogram when the number of distinct values is equal to or greater > than 254. > We first use [SPARK-18000] to compute equi-width histograms (for both numeric > and string types) or endpoints of equi-height histograms (for numeric type > only). Then, if we get endpoints of a equi-height histogram, we need to > compute ndv's between those endpoints by [SPARK-17997] to form the > equi-height histogram. > This Jira incorporates three Jiras mentioned above to support needed > aggregation functions. We need to resolve them before this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18000: Assignee: Apache Spark > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607297#comment-15607297 ] Apache Spark commented on SPARK-18000: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15637 > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18000: Assignee: (was: Apache Spark) > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18109) Log instrumentation in GMM
[ https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607289#comment-15607289 ] Apache Spark commented on SPARK-18109: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15636 > Log instrumentation in GMM > -- > > Key: SPARK-18109 > URL: https://issues.apache.org/jira/browse/SPARK-18109 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: zhengruifeng > > Add log instrumentation in GMM -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18109) Log instrumentation in GMM
[ https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18109: Assignee: (was: Apache Spark) > Log instrumentation in GMM > -- > > Key: SPARK-18109 > URL: https://issues.apache.org/jira/browse/SPARK-18109 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: zhengruifeng > > Add log instrumentation in GMM -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18109) Log instrumentation in GMM
[ https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18109: Assignee: Apache Spark > Log instrumentation in GMM > -- > > Key: SPARK-18109 > URL: https://issues.apache.org/jira/browse/SPARK-18109 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: zhengruifeng >Assignee: Apache Spark > > Add log instrumentation in GMM -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18109) Log instrumentation in GMM
zhengruifeng created SPARK-18109: Summary: Log instrumentation in GMM Key: SPARK-18109 URL: https://issues.apache.org/jira/browse/SPARK-18109 Project: Spark Issue Type: Sub-task Components: ML Reporter: zhengruifeng Add log instrumentation in GMM -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-18000: - Description: For a column, we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins). The agg function for a column returns bins - (distinct value, frequency) pairs of equi-width histogram when the number of distinct values is less than or equal to numBins. Otherwise, 1) for column of string type, it returns an empty map; 2) for column of numeric type (including DateType and TimestampType), it returns endpoints of equi-height histogram - approximate percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., (numBins-1)/numBins, 1.0. was: For a column of numeric type (including date and timestamp), we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins). This agg function computes values and their frequencies using a small hashmap, whose size is less than or equal to "numBins", and returns an equi-width histogram. When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes ApproximatePercentile to return endpoints of equi-height histogram. > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column, we will generate a equi-width or equi-height histogram, > depending on if its ndv is large than the maximum number of bins allowed in > one histogram (denoted as numBins). > The agg function for a column returns bins - (distinct value, frequency) > pairs of equi-width histogram when the number of distinct values is less than > or equal to numBins. Otherwise, 1) for column of string type, it returns an > empty map; 2) for column of numeric type (including DateType and > TimestampType), it returns endpoints of equi-height histogram - approximate > percentiles at percentages 0.0, 1/numBins, 2/numBins, ..., > (numBins-1)/numBins, 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/26/16 3:26 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *KafkaSinkRDD* *KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to get or create producer to send data * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *KafkaSinkRDD* *KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to get or create producer to send data * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafkaSink") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18000) Aggregation function for computing endpoints for histograms
[ https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-18000: - Summary: Aggregation function for computing endpoints for histograms (was: Aggregation function for computing endpoints for numeric histograms) > Aggregation function for computing endpoints for histograms > --- > > Key: SPARK-18000 > URL: https://issues.apache.org/jira/browse/SPARK-18000 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > For a column of numeric type (including date and timestamp), we will generate > a equi-width or equi-height histogram, depending on if its ndv is large than > the maximum number of bins allowed in one histogram (denoted as numBins). > This agg function computes values and their frequencies using a small > hashmap, whose size is less than or equal to "numBins", and returns an > equi-width histogram. > When the size of hashmap exceeds "numBins", it cleans the hashmap and > utilizes ApproximatePercentile to return endpoints of equi-height histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18100) Improve the performance of get_json_object using Gson
[ https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607204#comment-15607204 ] Liang-Chi Hsieh commented on SPARK-18100: - Looks like Gson has no native support for json path? > Improve the performance of get_json_object using Gson > - > > Key: SPARK-18100 > URL: https://issues.apache.org/jira/browse/SPARK-18100 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Based on some benchmark here: > http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, > which said Gson could be much faster than Jackson, maybe it could be used to > improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18108) Partition discovery fails with explicitly written long partitions
[ https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Moorhead updated SPARK-18108: - Attachment: stacktrace.out > Partition discovery fails with explicitly written long partitions > - > > Key: SPARK-18108 > URL: https://issues.apache.org/jira/browse/SPARK-18108 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Richard Moorhead >Priority: Minor > Attachments: stacktrace.out > > > We have parquet data written from Spark1.6 that, when read from 2.0.1, > produces errors. > {code} > case class A(a: Long, b: Int) > val as = Seq(A(1,2)) > //partition explicitly written > spark.createDataFrame(as).write.parquet("/data/a=1/") > spark.read.parquet("/data/").collect > {code} > The above code fails; stack trace attached. > If an integer used, explicit partition discovery succeeds. > {code} > case class A(a: Int, b: Int) > val as = Seq(A(1,2)) > //partition explicitly written > spark.createDataFrame(as).write.parquet("/data/a=1/") > spark.read.parquet("/data/").collect > {code} > The action succeeds. Additionally, if 'partitionBy' is used instead of > explicit writes, partition discovery succeeds. > Question: Is the first example a reasonable use case? > [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319] > seems to default to Integer types unless the partition value exceeds the > integer type's length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18108) Partition discovery fails with explicitly written long partitions
Richard Moorhead created SPARK-18108: Summary: Partition discovery fails with explicitly written long partitions Key: SPARK-18108 URL: https://issues.apache.org/jira/browse/SPARK-18108 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.0.1 Reporter: Richard Moorhead Priority: Minor Attachments: stacktrace.out We have parquet data written from Spark1.6 that, when read from 2.0.1, produces errors. {code} case class A(a: Long, b: Int) val as = Seq(A(1,2)) //partition explicitly written spark.createDataFrame(as).write.parquet("/data/a=1/") spark.read.parquet("/data/").collect {code} The above code fails; stack trace attached. If an integer used, explicit partition discovery succeeds. {code} case class A(a: Int, b: Int) val as = Seq(A(1,2)) //partition explicitly written spark.createDataFrame(as).write.parquet("/data/a=1/") spark.read.parquet("/data/").collect {code} The action succeeds. Additionally, if 'partitionBy' is used instead of explicit writes, partition discovery succeeds. Question: Is the first example a reasonable use case? [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319] seems to default to Integer types unless the partition value exceeds the integer type's length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
J.P Feng created SPARK-18107: Summary: Insert overwrite statement runs much slower in spark-sql than it does in hive-client Key: SPARK-18107 URL: https://issues.apache.org/jira/browse/SPARK-18107 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: spark 2.0.0 hive 2.0.1 Reporter: J.P Feng I find insert overwrite statement running in spark-sql or spark-shell spends much more time than it does in hive-client (i start it in apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but hive-client just costs less than 20 seconds. These are the steps I took. Test sql is : insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and dt='2016-10-21' ; there are 257128 lines of data in tbllog_login with partition(pt='mix_en',dt='2016-10-21') ps: I'm sure it must be "insert overwrite" costing a lot of time in spark, may be when doing overwrite, it need to spend a lot of time in io or in something else. I also compare the executing time between insert overwrite statement and insert into statement. 1. insert overwrite statement and insert into statement in spark: insert overwrite statement costs about 10 minutes insert into statement costs about 30 seconds 2. insert into statement in spark and insert into statement in hive-client: spark costs about 30 seconds hive-client costs about 20 seconds the difference is little that we can ignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607052#comment-15607052 ] Jerryjung edited comment on SPARK-18009 at 10/26/16 1:44 AM: - Yes! In my case, it's necessary option for integration with BI tools. was (Author: jerryjung): Yes! But In my case, it's necessary option for integration with BI tools. > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(
[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607052#comment-15607052 ] Jerryjung commented on SPARK-18009: --- Yes! But In my case, it's necessary option for integration with BI tools. > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244) > at > org.apache.hiv
[jira] [Assigned] (SPARK-18103) Rename *FileCatalog to *FileProvider
[ https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18103: Assignee: (was: Apache Spark) > Rename *FileCatalog to *FileProvider > > > Key: SPARK-18103 > URL: https://issues.apache.org/jira/browse/SPARK-18103 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Minor > > In the SQL component there are too many different components called some > variant of *Catalog, which is quite confusing. We should rename the > subclasses of FileCatalog to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18103) Rename *FileCatalog to *FileProvider
[ https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18103: Assignee: Apache Spark > Rename *FileCatalog to *FileProvider > > > Key: SPARK-18103 > URL: https://issues.apache.org/jira/browse/SPARK-18103 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > In the SQL component there are too many different components called some > variant of *Catalog, which is quite confusing. We should rename the > subclasses of FileCatalog to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18103) Rename *FileCatalog to *FileProvider
[ https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607020#comment-15607020 ] Apache Spark commented on SPARK-18103: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15634 > Rename *FileCatalog to *FileProvider > > > Key: SPARK-18103 > URL: https://issues.apache.org/jira/browse/SPARK-18103 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Minor > > In the SQL component there are too many different components called some > variant of *Catalog, which is quite confusing. We should rename the > subclasses of FileCatalog to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow
[ https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J.P Feng closed SPARK-18077. Resolution: Won't Fix i would try to open another one, for there are some mistakes in this issue. > Run insert overwrite statements in spark to overwrite a partitioned table is > very slow > --- > > Key: SPARK-18077 > URL: https://issues.apache.org/jira/browse/SPARK-18077 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 2.0 > hive 2.0.1 > driver memory: 4g > total executors: 4 > executor memory: 10g > total cores: 13 >Reporter: J.P Feng > Labels: hive, insert, sparkSQL > Original Estimate: 120h > Remaining Estimate: 120h > > Hello,all. I face a strange thing in my project. > there is a table: > CREATE TABLE `login4game`(`account_name` string, `role_id` string, > `server_id` string, `recdate` string) > PARTITIONED BY (`pt` string, `dt` string) stored as orc; > another table: > CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` > string, `happened_time` int) > PARTITIONED BY (`pt` string, `dt` string) > -- > Test-1: > executed sql in spark-shell or spark-sql( before i run this sql, there is > much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ): > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate from > tbllog_login where pt='mix_en' and dt='2016-10-21' > it will cost a lot of time, below is a part of the logs: > / > [Stage 5:===> (144 + 8) / > 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] > 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, > real=0.08 secs] > [Stage 5:=> (152 + 8) / > 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] > 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, > real=0.08 secs] > [Stage 5:> (160 + 8) / > 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] > 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, > real=0.07 secs] > [Stage 5:> (176 + 8) / > 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] > 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:> (177 + 8) / > 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] > 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:> (192 + 8) / > 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] > 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, > real=0.06 secs] > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > ... > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > / > i can see, the origin data is moved to .trash > and then, there is no log printing, and after about 10 min, the log print > again: > / > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713
[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18087: Assignee: Apache Spark > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18087: Assignee: (was: Apache Spark) > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606990#comment-15606990 ] Apache Spark commented on SPARK-18087: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15633 > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18087: Assignee: (was: Apache Spark) > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18087: Assignee: Apache Spark > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-18106: Description: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not "noscan" produces an AnalyzeTableCommand with noscan=false was: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not noscan produces an AnalyzeTableCommand with noscan=false > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-18106: Description: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not noscan produces an AnalyzeTableCommand with noscan=false was: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not {noformat}noscan{noformat} produces an AnalyzeTableCommand with {code}noscan=false{code} > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not noscan produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
Srinath created SPARK-18106: --- Summary: Analyze Table accepts a garbage identifier at the end Key: SPARK-18106 URL: https://issues.apache.org/jira/browse/SPARK-18106 Project: Spark Issue Type: Bug Components: SQL Reporter: Srinath Priority: Minor {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not {noformat}noscan{noformat} produces an AnalyzeTableCommand with {code}noscan=false{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606896#comment-15606896 ] Tathagata Das commented on SPARK-17829: --- Based on [~tcondie] PR above, I think its better we also change the main common log class HDFSMetadataLog to use Json serialization rather than Java serialization. But this also means that we have to modify FileStreamSourceLog (subclass of HDFSMetadataLog[FileEntry]) to also use json serialization. Which is good to fix as well, as the file stream source log should also have a stable on-disk format and not depend on java serialization. > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606849#comment-15606849 ] Apache Spark commented on SPARK-18105: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/15632 > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18105: Assignee: Apache Spark (was: Davies Liu) > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Apache Spark > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18105: Assignee: Davies Liu (was: Apache Spark) > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
Davies Liu created SPARK-18105: -- Summary: LZ4 failed to decompress a stream of shuffled data Key: SPARK-18105 URL: https://issues.apache.org/jira/browse/SPARK-18105 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu When lz4 is used to compress the shuffle files, it may fail to decompress it as "stream is corrupt" https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18104) Don't build KafkaSource doc
[ https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18104: Assignee: Shixiong Zhu (was: Apache Spark) > Don't build KafkaSource doc > --- > > Key: SPARK-18104 > URL: https://issues.apache.org/jira/browse/SPARK-18104 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Don't need to build doc for KafkaSource because the user should use the data > source APIs to use KafkaSource. All KafkaSource APIs are internal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18104) Don't build KafkaSource doc
[ https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606795#comment-15606795 ] Apache Spark commented on SPARK-18104: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15630 > Don't build KafkaSource doc > --- > > Key: SPARK-18104 > URL: https://issues.apache.org/jira/browse/SPARK-18104 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Don't need to build doc for KafkaSource because the user should use the data > source APIs to use KafkaSource. All KafkaSource APIs are internal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18104) Don't build KafkaSource doc
[ https://issues.apache.org/jira/browse/SPARK-18104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18104: Assignee: Apache Spark (was: Shixiong Zhu) > Don't build KafkaSource doc > --- > > Key: SPARK-18104 > URL: https://issues.apache.org/jira/browse/SPARK-18104 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Don't need to build doc for KafkaSource because the user should use the data > source APIs to use KafkaSource. All KafkaSource APIs are internal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18104) Don't build KafkaSource doc
Shixiong Zhu created SPARK-18104: Summary: Don't build KafkaSource doc Key: SPARK-18104 URL: https://issues.apache.org/jira/browse/SPARK-18104 Project: Spark Issue Type: Documentation Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606693#comment-15606693 ] Xiao Li commented on SPARK-18009: - [~dkbiswal] Please fix it tonight. Thanks! > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244) > at > org.apache.hive.service.cli.HiveSQLException.toSt
[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18009: Labels: thrift (was: sql thrift) > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210) > ... 15 more > Error
[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18009: Target Version/s: 2.0.1, 2.1.0 > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210) > ... 15 more > Error: E
[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-16988: --- Component/s: (was: Spark Core) Web UI > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: chie hayashida >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-16988: --- Component/s: (was: Spark Shell) > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: chie hayashida >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-16988: --- Component/s: Spark Core > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: chie hayashida >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-16988. Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: chie hayashida >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-16988: --- Assignee: chie hayashida > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: chie hayashida >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15482) ClassCast exception when join two tables.
[ https://issues.apache.org/jira/browse/SPARK-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606601#comment-15606601 ] roberto sancho rojas commented on SPARK-15482: -- I have the same problem, whe i execute this code from spark 1.6 and HDP 2.4.0.0-169 and PHOENIX 2.4.0 df = sqlContext.read \ .format("org.apache.phoenix.spark") \ .option("table", "TABLA") \ .option("zkUrl", "XXX:/hbase-unsecure") \ .load() df.show() Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492) here my claspathh: /usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-client.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-common.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-core-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar > ClassCast exception when join two tables. > - > > Key: SPARK-15482 > URL: https://issues.apache.org/jira/browse/SPARK-15482 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Phoenix: 1.2 > Spark: 1.5.0-cdh5.5.1 >Reporter: jingtao > > I have two tables A and B in Phoenix. > I load table 'A' as dataFrame 'ADF' using spark , and register dataFrame > ''ADF'' as temp table 'ATEMPTABLE'. > B is the same as A. > A --> ADF ---> ATEMPTABLE > B --> BDF ---> BTEMPTABLE > Then, i joins the two temp table 'ATEMPTABLE' and 'BTEMPTABLE' using spark > sql. > Such as 'select count(*) from ATEMPTABLE join BTEMPTABLE on ...' > It errors with the following message: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 6, hadoop05): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) > at org.apache.spark.util.EventLoop$$anon$1.run(E
[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606523#comment-15606523 ] Alex Bozarth commented on SPARK-18085: -- I am *very* interested in working with you on this project and (post-Spark Summit) would love to discuss some of the UI ideas my team has been tossing around (a few covered in your non-goals). > Scalability enhancements for the History Server > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Environment: Pyspark 2.0.0, Ipython 4.2 (was: Pyspark 2.0.1, Ipython 4.2) > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.0, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {code} > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > {code} > Execution plan: > {code} > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE > ^s) >+- LogicalRDD [input#0L] > {code} > Executing test_def.show() after the above code in pyspark 2.0.1 yields: > KeyError: 0 > Executing test_def.show() in pyspark 1.6.2 yields: > {code} > +-+--+ > |input|output| > +-+--+ > |2|second| > +-+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606484#comment-15606484 ] Nicholas Chammas commented on SPARK-18084: -- cc [~marmbrus] - Dunno if this is actually bug or just an unsupported or inappropriate use case. > write.partitionBy() does not recognize nested columns that select() can access > -- > > Key: SPARK-18084 > URL: https://issues.apache.org/jira/browse/SPARK-18084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a simple repro in the PySpark shell: > {code} > from pyspark.sql import Row > rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > df = spark.createDataFrame(rdd) > df.printSchema() > df.select('a.b').show() # works > df.write.partitionBy('a.b').text('/tmp/test') # doesn't work > {code} > Here's what I see when I run this: > {code} > >>> from pyspark.sql import Row > >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > >>> df = spark.createDataFrame(rdd) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: long (nullable = true) > >>> df.show() > +---+ > | a| > +---+ > |[5]| > +---+ > >>> df.select('a.b').show() > +---+ > | b| > +---+ > | 5| > +---+ > >>> df.write.partitionBy('a.b').text('/tmp/test') > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. > : org.apache.spark.sql.AnalysisException: Partition column a.b not found in > schema > StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/Cellar/apache-spark/2.0
[jira] [Created] (SPARK-18103) Rename *FileCatalog to *FileProvider
Eric Liang created SPARK-18103: -- Summary: Rename *FileCatalog to *FileProvider Key: SPARK-18103 URL: https://issues.apache.org/jira/browse/SPARK-18103 Project: Spark Issue Type: Improvement Reporter: Eric Liang Priority: Minor In the SQL component there are too many different components called some variant of *Catalog, which is quite confusing. We should rename the subclasses of FileCatalog to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18102) Failed to deserialize the result of task
Davies Liu created SPARK-18102: -- Summary: Failed to deserialize the result of task Key: SPARK-18102 URL: https://issues.apache.org/jira/browse/SPARK-18102 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} 16/10/25 15:17:04 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. java.lang.ClassNotFoundException: org.apache.spark.util*SerializableBuffer not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@3d98d138 at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257) at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:578) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel
[jira] [Updated] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields
[ https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18101: --- Issue Type: Sub-task (was: Test) Parent: SPARK-17861 > ExternalCatalogSuite should test with mixed case fields > --- > > Key: SPARK-18101 > URL: https://issues.apache.org/jira/browse/SPARK-18101 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > Currently, it uses field names such as "a" and "b" which are not useful for > testing case preservation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields
Eric Liang created SPARK-18101: -- Summary: ExternalCatalogSuite should test with mixed case fields Key: SPARK-18101 URL: https://issues.apache.org/jira/browse/SPARK-18101 Project: Spark Issue Type: Test Components: SQL Reporter: Eric Liang Currently, it uses field names such as "a" and "b" which are not useful for testing case preservation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17471: Assignee: Apache Spark > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606347#comment-15606347 ] Apache Spark commented on SPARK-17471: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/15628 > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17471: Assignee: (was: Apache Spark) > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18019. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15574 [https://github.com/apache/spark/pull/15574] > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > Fix For: 2.1.0 > > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18100) Improve the performance of get_json_object using Gson
[ https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18100: --- Issue Type: Improvement (was: Bug) > Improve the performance of get_json_object using Gson > - > > Key: SPARK-18100 > URL: https://issues.apache.org/jira/browse/SPARK-18100 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Based on some benchmark here: > http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, > which said Gson could be much faster than Jackson, maybe it could be used to > improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18100) Improve the performance of get_json_object using Gson
Davies Liu created SPARK-18100: -- Summary: Improve the performance of get_json_object using Gson Key: SPARK-18100 URL: https://issues.apache.org/jira/browse/SPARK-18100 Project: Spark Issue Type: Bug Reporter: Davies Liu Based on some benchmark here: http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, which said Gson could be much faster than Jackson, maybe it could be used to improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18088: -- Comment: was deleted (was: Calling this a bug since FPR is not implemented correctly.) > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18070) binary operator should not consider nullability when comparing input types
[ https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18070. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 Issue resolved by pull request 15606 [https://github.com/apache/spark/pull/15606] > binary operator should not consider nullability when comparing input types > -- > > Key: SPARK-18070 > URL: https://issues.apache.org/jira/browse/SPARK-18070 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.2, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606178#comment-15606178 ] Joseph K. Bradley commented on SPARK-17692: --- [SPARK-17870] changes the output of ChiSqSelector. It is a bug fix, so it is an acceptable change of behavior. > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. > * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for > SelectKBest features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606146#comment-15606146 ] Joseph K. Bradley commented on SPARK-18088: --- How do you feel about renaming the selectorType values to match the parameters? I'd like to call them "numTopFeatures", "percentile" and "fpr". > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18088: -- Priority: Minor (was: Major) > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18088: -- Issue Type: Improvement (was: Bug) > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606142#comment-15606142 ] Joseph K. Bradley commented on SPARK-18088: --- Ahh, you're right, sorry, I see that now that I'm looking at master. I'll link the follow-up JIRA to the original JIRA. And I agree my assertion about p-value wasn't correct. Will fix. Thanks! > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups > One major item: FPR is not implemented correctly. Testing against only the > p-value and not the test statistic does not really tell you anything. We > should follow sklearn, which allows a p-value threshold for any selection > method: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html] > * In this PR, I'm just going to remove FPR completely. We can add it back in > a follow-up PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18088: -- Description: There are several cleanups I'd like to make as a follow-up to the PRs from [SPARK-17017]: * Rename selectorType values to match corresponding Params * Add Since tags where missing * a few minor cleanups was: There are several cleanups I'd like to make as a follow-up to the PRs from [SPARK-17017]: * Rename selectorType values to match corresponding Params * Add Since tags where missing * a few minor cleanups One major item: FPR is not implemented correctly. Testing against only the p-value and not the test statistic does not really tell you anything. We should follow sklearn, which allows a p-value threshold for any selection method: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html] * In this PR, I'm just going to remove FPR completely. We can add it back in a follow-up PR. > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives
[ https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18099: Assignee: Apache Spark > Spark distributed cache should throw exception if same file is specified to > dropped in --files --archives > - > > Key: SPARK-18099 > URL: https://issues.apache.org/jira/browse/SPARK-18099 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kishor Patil >Assignee: Apache Spark > > Recently, for the changes to [SPARK-14423] Handle jar conflict issue when > uploading to distributed cache > If by default yarn#client will upload all the --files and --archives in > assembly to HDFS staging folder. It should throw if file appears in both > --files and --archives exception to know whether uncompress or leave the file > compressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives
[ https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18099: Assignee: (was: Apache Spark) > Spark distributed cache should throw exception if same file is specified to > dropped in --files --archives > - > > Key: SPARK-18099 > URL: https://issues.apache.org/jira/browse/SPARK-18099 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kishor Patil > > Recently, for the changes to [SPARK-14423] Handle jar conflict issue when > uploading to distributed cache > If by default yarn#client will upload all the --files and --archives in > assembly to HDFS staging folder. It should throw if file appears in both > --files and --archives exception to know whether uncompress or leave the file > compressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives
[ https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606068#comment-15606068 ] Apache Spark commented on SPARK-18099: -- User 'kishorvpatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15627 > Spark distributed cache should throw exception if same file is specified to > dropped in --files --archives > - > > Key: SPARK-18099 > URL: https://issues.apache.org/jira/browse/SPARK-18099 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kishor Patil > > Recently, for the changes to [SPARK-14423] Handle jar conflict issue when > uploading to distributed cache > If by default yarn#client will upload all the --files and --archives in > assembly to HDFS staging folder. It should throw if file appears in both > --files and --archives exception to know whether uncompress or leave the file > compressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives
Kishor Patil created SPARK-18099: Summary: Spark distributed cache should throw exception if same file is specified to dropped in --files --archives Key: SPARK-18099 URL: https://issues.apache.org/jira/browse/SPARK-18099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.1, 2.0.0 Reporter: Kishor Patil Recently, for the changes to [SPARK-14423] Handle jar conflict issue when uploading to distributed cache If by default yarn#client will upload all the --files and --archives in assembly to HDFS staging folder. It should throw if file appears in both --files and --archives exception to know whether uncompress or leave the file compressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql
[ https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Porter updated SPARK-16183: --- Affects Version/s: 2.0.0 > Large Spark SQL commands cause StackOverflowError in parser when using > sqlContext.sql > - > > Key: SPARK-16183 > URL: https://issues.apache.org/jira/browse/SPARK-16183 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Running on AWS EMR >Reporter: Matthew Porter > > Hi, > I have created a PySpark SQL-based tool which auto-generates a complex SQL > command to be run via sqlContext.sql(cmd) based on a large number of > parameters. As the number of input files to be filtered and joined in this > query grows, so does the length of the SQL query. The tool runs fine up until > about 200+ files are included in the join, at which point the SQL command > becomes very long (~100K characters). It is only on these longer queries that > Spark fails, throwing an exception due to what seems to be too much recursion > occurring within the SparkSQL parser: > {code} > Traceback (most recent call last): > ... > merged_df = sqlsc.sql(cmd) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line > 580, in sql > File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, > in deco > File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line > 308, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql. > : java.lang.StackOverflowError > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > sca
[jira] [Resolved] (SPARK-18010) Remove unneeded heavy work performed by FsHistoryProvider for building up the application listing UI page
[ https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18010. Resolution: Fixed Assignee: Vinayak Joshi Fix Version/s: 2.1.0 > Remove unneeded heavy work performed by FsHistoryProvider for building up the > application listing UI page > - > > Key: SPARK-18010 > URL: https://issues.apache.org/jira/browse/SPARK-18010 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 1.6.2, 2.0.1, 2.1.0 >Reporter: Vinayak Joshi >Assignee: Vinayak Joshi > Fix For: 2.1.0 > > > There are known complaints/cribs about History Server's Application List not > updating quickly enough when the event log files that need replay are huge. > Currently, the FsHistoryProvider design causes the entire event log file to > be replayed when building the initial application listing (refer the method > mergeApplicationListing(fileStatus: FileStatus) ). The process of replay > involves: > - each line in the event log being read as a string, > - parsing the string to a Json structure > - converting the Json to the corresponding Scala classes with nested > structures > Particularly the part involving parsing string to Json and then to Scala > classes is expensive. Tests show that majority of time spent in replay is in > doing this work. > When the replay is performed for building the application listing, the only > two events that the code really cares for are "SparkListenerApplicationStart" > and "SparkListenerApplicationEnd" - since the only listener attached to the > ReplayListenerBus at that point is the ApplicationEventListener. This means > that when processing an event log file with a huge number (hundreds of > thousands, can be more) of events, the work done to deserialize all of these > event, and then replay them is not needed. Only two events are what we're > interested in, and this can be used to ensure that when replay is performed > for the purpose of building the application list, we only make the effort to > replay these two events and not others. > My tests show that this drastically improves application list load time. For > a 150MB event log from a user, with over 100,000 events, the load time (local > on my mac) comes down from about 16 secs to under 1 second using this > approach. For customers that typically execute applications with large event > logs, and thus have multiple large event logs present, this can speed up how > soon the history server UI lists the apps considerably. > I will be updating a pull request with take at fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor
[ https://issues.apache.org/jira/browse/SPARK-18098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605898#comment-15605898 ] Sean Owen commented on SPARK-18098: --- It shouldn't work that way. The value is loaded in a lazy val, at least. I think I can imagine cases where you would end up with several per executor but they're not the normal use cases. Can you say more about what you're executing or what you're seeing? > Broadcast creates 1 instance / core, not 1 instance / executor > -- > > Key: SPARK-18098 > URL: https://issues.apache.org/jira/browse/SPARK-18098 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Anthony Sciola > > I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m > 55g > When I run a job which broadcasts data, it appears each *thread* requests and > receives a copy of the broadcast object, not each *executor*. This means I > need 7x as much memory for the broadcasted item because I have 7 cores. > The problem appears to be due to a lack of synchronization around requesting > broadcast items. > The only workaround I've come up with is writing the data out to HDFS, > broadcasting the paths, and doing a synchronized load from HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605879#comment-15605879 ] Apache Spark commented on SPARK-17829: -- User 'tcondie' has created a pull request for this issue: https://github.com/apache/spark/pull/15626 > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17829: Assignee: Tyson Condie (was: Apache Spark) > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17829: Assignee: Apache Spark (was: Tyson Condie) > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor
Anthony Sciola created SPARK-18098: -- Summary: Broadcast creates 1 instance / core, not 1 instance / executor Key: SPARK-18098 URL: https://issues.apache.org/jira/browse/SPARK-18098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Reporter: Anthony Sciola I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m 55g When I run a job which broadcasts data, it appears each *thread* requests and receives a copy of the broadcast object, not each *executor*. This means I need 7x as much memory for the broadcasted item because I have 7 cores. The problem appears to be due to a lack of synchronization around requesting broadcast items. The only workaround I've come up with is writing the data out to HDFS, broadcasting the paths, and doing a synchronized load from HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605820#comment-15605820 ] Marcelo Vanzin commented on SPARK-6951: --- I reopened this after discussion in the bug; the other change (SPARK-18010) makes startup a little faster, but not necessarily fast, for large directories / log files. > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-6951: --- > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org