[jira] [Commented] (SPARK-18739) Models in pyspark.classification support setXXXCol methods
[ https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725334#comment-15725334 ] Apache Spark commented on SPARK-18739: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16171 > Models in pyspark.classification support setXXXCol methods > -- > > Key: SPARK-18739 > URL: https://issues.apache.org/jira/browse/SPARK-18739 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng > > Now, models in pyspark don't suport {{setXXCol}} methods at all. > I update models in {{classification.py}} according the hierarchy in the scala > side: > 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class > {{JavaPredictionModel}} > 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} > 3, create class {{JavaProbabilisticClassificationModel}} inherit > {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it > 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, > {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit > {{JavaProbabilisticClassificationModel}} > 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} > inherit {{JavaClassificationModel}} > 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and > {{setPredictionCol}} method. > With regard to models in clustering and features, I suggest that we first add > some abstract classes like {{ClusteringModel}}, > {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, > otherwise we need to manually add setXXXCol methods one by one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18739) Models in pyspark.classification support setXXXCol methods
[ https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18739: - Description: Now, models in pyspark don't suport {{setXXCol}} methods at all. I update models in {{classification.py}} according the hierarchy in the scala side: 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class {{JavaPredictionModel}} 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} 3, create class {{JavaProbabilisticClassificationModel}} inherit {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit {{JavaProbabilisticClassificationModel}} 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} inherit {{JavaClassificationModel}} 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and {{setPredictionCol}} method. With regard to models in clustering and features, I suggest that we first add some abstract classes like {{ClusteringModel}}, {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, otherwise we need to manually add setXXXCol methods one by one. was: Now, models in pyspark don't suport {{setXXCol}} methods at all. I update models in {{classification.py}} according the hierarchy in the scala side: 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class {{JavaPredictionModel}} 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} 3, create class {{JavaProbabilisticClassificationModel}} inherit {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit {{JavaProbabilisticClassificationModel}} 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} inherit {{JavaClassificationModel}} 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and {{setPredictionCol}} method. With regard to model clustering and features, I suggest that we first add some abstract classes like {{ClusteringModel}}, {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, otherwise we need to manually add setXXXCol methods one by one. > Models in pyspark.classification support setXXXCol methods > -- > > Key: SPARK-18739 > URL: https://issues.apache.org/jira/browse/SPARK-18739 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng > > Now, models in pyspark don't suport {{setXXCol}} methods at all. > I update models in {{classification.py}} according the hierarchy in the scala > side: > 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class > {{JavaPredictionModel}} > 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} > 3, create class {{JavaProbabilisticClassificationModel}} inherit > {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it > 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, > {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit > {{JavaProbabilisticClassificationModel}} > 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} > inherit {{JavaClassificationModel}} > 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and > {{setPredictionCol}} method. > With regard to models in clustering and features, I suggest that we first add > some abstract classes like {{ClusteringModel}}, > {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, > otherwise we need to manually add setXXXCol methods one by one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18739) Models in pyspark.classification support setXXXCol methods
zhengruifeng created SPARK-18739: Summary: Models in pyspark.classification support setXXXCol methods Key: SPARK-18739 URL: https://issues.apache.org/jira/browse/SPARK-18739 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: zhengruifeng Now, models in pyspark don't suport {{setXXCol}} methods at all. I update models in {{classification.py}} according the hierarchy in the scala side: 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class {{JavaPredictionModel}} 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} 3, create class {{JavaProbabilisticClassificationModel}} inherit {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit {{JavaProbabilisticClassificationModel}} 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} inherit {{JavaClassificationModel}} 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and {{setPredictionCol}} method. With regard to model clustering and features, I suggest that we first add some abstract classes like {{ClusteringModel}}, {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, otherwise we need to manually add setXXXCol methods one by one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A
[ https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725242#comment-15725242 ] Adrian Bridgett commented on SPARK-18512: - So currently, with Spark2 there's no sensible way to write to S3? (Think of this as a question not a rant!) That is no way to avoid either S3 rename latency problems or this issue, unless you use EMRFS or e.g. write to HDFS first and distcp the files over? I wonder if a backport of MAPREDUCE-6478 to hadoop-2.7.x is on the cards (hadoop-2.8.x is presumably a while away from production readiness). > FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and > S3A > > > Key: SPARK-18512 > URL: https://issues.apache.org/jira/browse/SPARK-18512 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 > Environment: AWS EMR 5.0.1 > Spark 2.0.1 > S3 EU-West-1 (S3A) >Reporter: Giuseppe Bonaccorso > > After a few hours of streaming processing and data saving in Parquet format, > I got always this exception: > {code:java} > java.io.FileNotFoundException: No such file or directory: > s3a://xxx/_temporary/0/task_ > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) > {code} > I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18736: -- Description: Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to create a map with two identical keys: {noformat} CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) {noformat} This does not behave like standard maps in common programming languages. proper behavior should be chosen: # first 'wins' # last 'wins' # runtime error. {{GetMapValue}} currently implements option #1. Even if this is the desired behavior {{CreateMap}} should return a unique map. was: Spark-Sql, CreateMap does not enforce unique keys, i.e. it's possible to create a map with two identical keys: CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) This does not behave like standard maps in common programming languages. proper behavior should be chosen" 1. first 'wins' 2. last 'wins' 3. runtime error. * currently GetMapValue implements option #1. even if this is the desired behavior CreateMap should return a unique map. > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18736: -- Summary: CreateMap allows non-unique keys (was: [SQL] CreateMap allow non-unique keys) > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, CreateMap does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > CreateMap(Literal(1), Literal(11), >Literal(1), Literal(12)) > This does not behave like standard maps in common programming languages. > proper behavior should be chosen" > 1. first 'wins' > 2. last 'wins' > 3. runtime error. > * currently GetMapValue implements option #1. even if this is the desired > behavior CreateMap should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725194#comment-15725194 ] Apache Spark commented on SPARK-18634: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/16170 > Corruption and Correctness issues with exploding Python UDFs > > > Key: SPARK-18634 > URL: https://issues.apache.org/jira/browse/SPARK-18634 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Liang-Chi Hsieh > Fix For: 2.0.3, 2.1.0 > > > There are some weird issues with exploding Python UDFs in SparkSQL. > There are 2 cases where based on the DataType of the exploded column, the > result can be flat out wrong, or corrupt. Seems like something bad is > happening when telling Tungsten the schema of the rows during or after > applying the UDF. > Please check the code below for reproduction. > Notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725130#comment-15725130 ] Apache Spark commented on SPARK-18326: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16169 > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18209: Assignee: (was: Apache Spark) > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. > Update 1: based on the discussion below, we don't even need to put the view > definition in a sub query. We can just add it via a logical plan at the end. > Update 2: we should make sure permanent views do not depend on temporary > objects (views, tables, or functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18705) Docs for one-pass solver for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-18705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18705: Shepherd: Yanbo Liang Assignee: Seth Hendrickson > Docs for one-pass solver for linear regression with L1 and elastic-net > penalties > > > Key: SPARK-18705 > URL: https://issues.apache.org/jira/browse/SPARK-18705 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Assignee: Seth Hendrickson >Priority: Minor > > Add document for SPARK-17748 at [{{Normal equation solver for weighted least > squares}}|http://spark.apache.org/docs/latest/ml-advanced.html#normal-equation-solver-for-weighted-least-squares] > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725135#comment-15725135 ] Yanbo Liang commented on SPARK-18326: - [~josephkb] I made a quick pass for new ML wrapper APIs which were added in the 2.1 release cycle and sent a PR. Thanks. > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18326: Assignee: Apache Spark > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18326: Assignee: (was: Apache Spark) > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-18326: --- Assignee: Yanbo Liang > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725038#comment-15725038 ] Apache Spark commented on SPARK-18209: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/16168 > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. > Update 1: based on the discussion below, we don't even need to put the view > definition in a sub query. We can just add it via a logical plan at the end. > Update 2: we should make sure permanent views do not depend on temporary > objects (views, tables, or functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18209: Assignee: Apache Spark > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. > Update 1: based on the discussion below, we don't even need to put the view > definition in a sub query. We can just add it via a logical plan at the end. > Update 2: we should make sure permanent views do not depend on temporary > objects (views, tables, or functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18326: Assignee: (was: Yanbo Liang) > SparkR 2.1 QA: New R APIs and API docs > -- > > Key: SPARK-18326 > URL: https://issues.apache.org/jira/browse/SPARK-18326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.
Lifeng Wang created SPARK-18738: --- Summary: Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation. Key: SPARK-18738 URL: https://issues.apache.org/jira/browse/SPARK-18738 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Lifeng Wang Fix For: 2.2.0 We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on both Erasure Coding and 3-replication. The test results show that some queries has much worse performance on EC compared to 3-replication. After initial investigations, we found spark starts one third executors to execute queries on EC compared to 3-replication. We use query 30 as example, our cluster can totally launch 108 executors. When we run the query from 3-replication database, spark will start all 108 executors to execute the query. When we run the query from Erasure Coding database, spark will launch 108 executors and kill 72 executors due to they’re idle, at last there are only 36 executors to execute the query which leads to poor performance. This issues only happens when we enable dynamic allocations mechanism. When we disable the dynamic allocations, Spark SQL query on EC has the similar performance with on 3-replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-18737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dr. Michael Menzel updated SPARK-18737: --- Priority: Blocker (was: Major) > Serialization setting "spark.serializer" ignored in Spark 2.x > - > > Key: SPARK-18737 > URL: https://issues.apache.org/jira/browse/SPARK-18737 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dr. Michael Menzel >Priority: Blocker > > The following exception occurs although the JavaSerializer has been activated: > 16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID > 77, ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, > 5621 bytes) > 16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 77 on executor id: 2 hostname: > ip-10-121-14-147.eu-central-1.compute.internal. > 16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory > on ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: > 410.4 MB) > 16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, > ip-10-121-14-147.eu-central-1.compute.internal): > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at org.apache.spark.util.NextIterator.to(NextIterator.scala:21) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21) > at > org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927) > at > org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now > 2.0.1, we see the Kyro deserialization exception and over time the Spark > streaming job stops processing since too many tasks failed. > Our action was to use conf.set("spark.serializer", > "org.apache.spark.serializer.JavaSerializer") and to disable Kryo class > registration with conf.set("spark.kryo.registrationRequired", false). We hope > to identify the root cause of the exception. > However, setting the serializer to JavaSerializer is oviously ignored by the > Spark-internals. Despite the setting we still see the exception printed in > the log and tasks fail. The occurence seems to be non-deterministic, but to > become more frequent over time. > Several questions we could not answer during our troubleshooting: > 1. How can the debug log for Kryo be enabled? -- We tried following the > minilog documentation, but no output can be found. > 2. Is the serializer setting effective for Spark internal serializations? How > can the JavaSerialize be forced on internal serializations for worker to > driv
[jira] [Updated] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.
[ https://issues.apache.org/jira/browse/SPARK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lifeng Wang updated SPARK-18738: Description: We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on both Erasure Coding and 3-replication. The test results show that some queries has much worse performance on EC compared to 3-replication. After initial investigations, we found spark starts one third executors to execute queries on EC compared to 3-replication. We use query 30 as example, our cluster can totally launch 108 executors. When we run the query from 3-replication database, spark will start all 108 executors to execute the query. When we run the query from Erasure Coding database, spark will launch 108 executors and kill 72 executors due to they’re idle, at last there are only 36 executors to execute the query which leads to poor performance. This issue only happens when we enable dynamic allocations mechanism. When we disable the dynamic allocations, Spark SQL query on EC has the similar performance with on 3-replication. was: We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on both Erasure Coding and 3-replication. The test results show that some queries has much worse performance on EC compared to 3-replication. After initial investigations, we found spark starts one third executors to execute queries on EC compared to 3-replication. We use query 30 as example, our cluster can totally launch 108 executors. When we run the query from 3-replication database, spark will start all 108 executors to execute the query. When we run the query from Erasure Coding database, spark will launch 108 executors and kill 72 executors due to they’re idle, at last there are only 36 executors to execute the query which leads to poor performance. This issues only happens when we enable dynamic allocations mechanism. When we disable the dynamic allocations, Spark SQL query on EC has the similar performance with on 3-replication. > Some Spark SQL queries has poor performance on HDFS Erasure Coding feature > when enabling dynamic allocation. > > > Key: SPARK-18738 > URL: https://issues.apache.org/jira/browse/SPARK-18738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lifeng Wang > Fix For: 2.2.0 > > > We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk > and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on > both Erasure Coding and 3-replication. The test results show that some > queries has much worse performance on EC compared to 3-replication. After > initial investigations, we found spark starts one third executors to execute > queries on EC compared to 3-replication. > We use query 30 as example, our cluster can totally launch 108 executors. > When we run the query from 3-replication database, spark will start all 108 > executors to execute the query. When we run the query from Erasure Coding > database, spark will launch 108 executors and kill 72 executors due to > they’re idle, at last there are only 36 executors to execute the query which > leads to poor performance. > This issue only happens when we enable dynamic allocations mechanism. When we > disable the dynamic allocations, Spark SQL query on EC has the similar > performance with on 3-replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org