[jira] [Commented] (SPARK-18739) Models in pyspark.classification support setXXXCol methods

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725334#comment-15725334
 ] 

Apache Spark commented on SPARK-18739:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16171

> Models in pyspark.classification support setXXXCol methods
> --
>
> Key: SPARK-18739
> URL: https://issues.apache.org/jira/browse/SPARK-18739
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Now, models in pyspark don't suport {{setXXCol}} methods at all.
> I update models in {{classification.py}} according the hierarchy in the scala 
> side:
> 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
> {{JavaPredictionModel}}
> 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
> 3, create class {{JavaProbabilisticClassificationModel}} inherit 
> {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
> 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
> {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
> {{JavaProbabilisticClassificationModel}}
> 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
> inherit {{JavaClassificationModel}}
> 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
> {{setPredictionCol}} method.
> With regard to models in clustering and features, I suggest that we first add 
> some abstract classes like {{ClusteringModel}}, 
> {{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
> otherwise we need to manually add setXXXCol methods one by one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18739) Models in pyspark.classification support setXXXCol methods

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18739:
-
Description: 
Now, models in pyspark don't suport {{setXXCol}} methods at all.

I update models in {{classification.py}} according the hierarchy in the scala 
side:
1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
{{JavaPredictionModel}}
2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
3, create class {{JavaProbabilisticClassificationModel}} inherit 
{{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
{{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
{{JavaProbabilisticClassificationModel}}
5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
inherit {{JavaClassificationModel}}
6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
{{setPredictionCol}} method.

With regard to models in clustering and features, I suggest that we first add 
some abstract classes like {{ClusteringModel}}, 
{{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
otherwise we need to manually add setXXXCol methods one by one.


  was:
Now, models in pyspark don't suport {{setXXCol}} methods at all.

I update models in {{classification.py}} according the hierarchy in the scala 
side:
1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
{{JavaPredictionModel}}
2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
3, create class {{JavaProbabilisticClassificationModel}} inherit 
{{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
{{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
{{JavaProbabilisticClassificationModel}}
5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
inherit {{JavaClassificationModel}}
6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
{{setPredictionCol}} method.

With regard to model clustering and features, I suggest that we first add some 
abstract classes like {{ClusteringModel}}, {{ProbabilisticClusteringModel}},  
{{FeatureModel}} in the scala side, otherwise we need to manually add setXXXCol 
methods one by one.



> Models in pyspark.classification support setXXXCol methods
> --
>
> Key: SPARK-18739
> URL: https://issues.apache.org/jira/browse/SPARK-18739
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Now, models in pyspark don't suport {{setXXCol}} methods at all.
> I update models in {{classification.py}} according the hierarchy in the scala 
> side:
> 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
> {{JavaPredictionModel}}
> 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
> 3, create class {{JavaProbabilisticClassificationModel}} inherit 
> {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
> 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
> {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
> {{JavaProbabilisticClassificationModel}}
> 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
> inherit {{JavaClassificationModel}}
> 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
> {{setPredictionCol}} method.
> With regard to models in clustering and features, I suggest that we first add 
> some abstract classes like {{ClusteringModel}}, 
> {{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
> otherwise we need to manually add setXXXCol methods one by one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18739) Models in pyspark.classification support setXXXCol methods

2016-12-06 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18739:


 Summary: Models in pyspark.classification support setXXXCol methods
 Key: SPARK-18739
 URL: https://issues.apache.org/jira/browse/SPARK-18739
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: zhengruifeng


Now, models in pyspark don't suport {{setXXCol}} methods at all.

I update models in {{classification.py}} according the hierarchy in the scala 
side:
1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
{{JavaPredictionModel}}
2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
3, create class {{JavaProbabilisticClassificationModel}} inherit 
{{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
{{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
{{JavaProbabilisticClassificationModel}}
5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
inherit {{JavaClassificationModel}}
6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
{{setPredictionCol}} method.

With regard to model clustering and features, I suggest that we first add some 
abstract classes like {{ClusteringModel}}, {{ProbabilisticClusteringModel}},  
{{FeatureModel}} in the scala side, otherwise we need to manually add setXXXCol 
methods one by one.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-12-06 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725242#comment-15725242
 ] 

Adrian Bridgett commented on SPARK-18512:
-

So currently, with Spark2 there's no sensible way to write to S3? (Think of 
this as a question not a rant!) That is no way to avoid either S3 rename 
latency problems or this issue, unless you use EMRFS or e.g. write to HDFS 
first and distcp the files over?

I wonder if a backport of MAPREDUCE-6478 to hadoop-2.7.x is on the cards 
(hadoop-2.8.x is presumably a while away from production readiness).

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18736:
--
Description: 
Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
create a map with two identical keys: 
{noformat}
CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
{noformat}

This does not behave like standard maps in common programming languages.
proper behavior should be chosen:
# first 'wins'
# last 'wins'
# runtime error.

{{GetMapValue}} currently implements option #1. Even if this is the desired 
behavior {{CreateMap}} should return a unique map.

  was:
Spark-Sql, CreateMap does not enforce unique keys, i.e. it's possible to create 
a map with two identical keys: 
CreateMap(Literal(1), Literal(11),
   Literal(1), Literal(12))

This does not behave like standard maps in common programming languages.
proper behavior should be chosen"
1. first 'wins'
2. last 'wins'
3. runtime error.

* currently GetMapValue implements option #1. even if this is the desired 
behavior CreateMap should return a unique map.


> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18736:
--
Summary: CreateMap allows non-unique keys  (was: [SQL] CreateMap allow 
non-unique keys)

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, CreateMap does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> CreateMap(Literal(1), Literal(11),
>Literal(1), Literal(12))
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen"
> 1. first 'wins'
> 2. last 'wins'
> 3. runtime error.
> * currently GetMapValue implements option #1. even if this is the desired 
> behavior CreateMap should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725194#comment-15725194
 ] 

Apache Spark commented on SPARK-18634:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16170

> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.0
>
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the code below for reproduction.
> Notebook: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725130#comment-15725130
 ] 

Apache Spark commented on SPARK-18326:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16169

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18209:


Assignee: (was: Apache Spark)

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18705) Docs for one-pass solver for linear regression with L1 and elastic-net penalties

2016-12-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18705:

Shepherd: Yanbo Liang
Assignee: Seth Hendrickson

> Docs for one-pass solver for linear regression with L1 and elastic-net 
> penalties
> 
>
> Key: SPARK-18705
> URL: https://issues.apache.org/jira/browse/SPARK-18705
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Add document for SPARK-17748 at [{{Normal equation solver for weighted least 
> squares}}|http://spark.apache.org/docs/latest/ml-advanced.html#normal-equation-solver-for-weighted-least-squares]
>  session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725135#comment-15725135
 ] 

Yanbo Liang commented on SPARK-18326:
-

[~josephkb] I made a quick pass for new ML wrapper APIs which were added in the 
2.1 release cycle and sent a PR. Thanks.

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18326:


Assignee: Apache Spark

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18326:


Assignee: (was: Apache Spark)

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-18326:
---

Assignee: Yanbo Liang

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725038#comment-15725038
 ] 

Apache Spark commented on SPARK-18209:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/16168

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18209:


Assignee: Apache Spark

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18326) SparkR 2.1 QA: New R APIs and API docs

2016-12-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18326:

Assignee: (was: Yanbo Liang)

> SparkR 2.1 QA: New R APIs and API docs
> --
>
> Key: SPARK-18326
> URL: https://issues.apache.org/jira/browse/SPARK-18326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.

2016-12-06 Thread Lifeng Wang (JIRA)
Lifeng Wang created SPARK-18738:
---

 Summary: Some Spark SQL queries has poor performance on HDFS 
Erasure Coding feature when enabling dynamic allocation.
 Key: SPARK-18738
 URL: https://issues.apache.org/jira/browse/SPARK-18738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Lifeng Wang
 Fix For: 2.2.0


We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
both Erasure Coding and 3-replication.  The test results show that some queries 
has much worse performance on EC compared to 3-replication. After initial 
investigations, we found spark starts one third executors to execute queries on 
EC compared to 3-replication. 

We use query 30 as example, our cluster can totally launch 108 executors. When 
we run the query from 3-replication database, spark will start all 108 
executors to execute the query.  When we run the query from Erasure Coding 
database, spark will launch 108 executors and kill 72 executors due to they’re 
idle, at last there are only 36 executors to execute the query which leads to 
poor performance.

This issues only happens when we enable dynamic allocations mechanism. When we 
disable the dynamic allocations, Spark SQL query on EC has the similar 
performance with on 3-replication.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x

2016-12-06 Thread Dr. Michael Menzel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dr. Michael Menzel updated SPARK-18737:
---
Priority: Blocker  (was: Major)

> Serialization setting "spark.serializer" ignored in Spark 2.x
> -
>
> Key: SPARK-18737
> URL: https://issues.apache.org/jira/browse/SPARK-18737
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dr. Michael Menzel
>Priority: Blocker
>
> The following exception occurs although the JavaSerializer has been activated:
> 16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 
> 77, ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, 
> 5621 bytes)
> 16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 77 on executor id: 2 hostname: 
> ip-10-121-14-147.eu-central-1.compute.internal.
> 16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory 
> on ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: 
> 410.4 MB)
> 16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, 
> ip-10-121-14-147.eu-central-1.compute.internal): 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now 
> 2.0.1, we see the Kyro deserialization exception and over time the Spark 
> streaming job stops processing since too many tasks failed.
> Our action was to use conf.set("spark.serializer", 
> "org.apache.spark.serializer.JavaSerializer") and to disable Kryo class 
> registration with conf.set("spark.kryo.registrationRequired", false). We hope 
> to identify the root cause of the exception. 
> However, setting the serializer to JavaSerializer is oviously ignored by the 
> Spark-internals. Despite the setting we still see the exception printed in 
> the log and tasks fail. The occurence seems to be non-deterministic, but to 
> become more frequent over time.
> Several questions we could not answer during our troubleshooting:
> 1. How can the debug log for Kryo be enabled? -- We tried following the 
> minilog documentation, but no output can be found.
> 2. Is the serializer setting effective for Spark internal serializations? How 
> can the JavaSerialize be forced on internal serializations for worker to 
> driv

[jira] [Updated] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.

2016-12-06 Thread Lifeng Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lifeng Wang updated SPARK-18738:

Description: 
We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
both Erasure Coding and 3-replication.  The test results show that some queries 
has much worse performance on EC compared to 3-replication. After initial 
investigations, we found spark starts one third executors to execute queries on 
EC compared to 3-replication. 

We use query 30 as example, our cluster can totally launch 108 executors. When 
we run the query from 3-replication database, spark will start all 108 
executors to execute the query.  When we run the query from Erasure Coding 
database, spark will launch 108 executors and kill 72 executors due to they’re 
idle, at last there are only 36 executors to execute the query which leads to 
poor performance.

This issue only happens when we enable dynamic allocations mechanism. When we 
disable the dynamic allocations, Spark SQL query on EC has the similar 
performance with on 3-replication.


  was:
We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
both Erasure Coding and 3-replication.  The test results show that some queries 
has much worse performance on EC compared to 3-replication. After initial 
investigations, we found spark starts one third executors to execute queries on 
EC compared to 3-replication. 

We use query 30 as example, our cluster can totally launch 108 executors. When 
we run the query from 3-replication database, spark will start all 108 
executors to execute the query.  When we run the query from Erasure Coding 
database, spark will launch 108 executors and kill 72 executors due to they’re 
idle, at last there are only 36 executors to execute the query which leads to 
poor performance.

This issues only happens when we enable dynamic allocations mechanism. When we 
disable the dynamic allocations, Spark SQL query on EC has the similar 
performance with on 3-replication.



> Some Spark SQL queries has poor performance on HDFS Erasure Coding feature 
> when enabling dynamic allocation.
> 
>
> Key: SPARK-18738
> URL: https://issues.apache.org/jira/browse/SPARK-18738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lifeng Wang
> Fix For: 2.2.0
>
>
> We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
> and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
> both Erasure Coding and 3-replication.  The test results show that some 
> queries has much worse performance on EC compared to 3-replication. After 
> initial investigations, we found spark starts one third executors to execute 
> queries on EC compared to 3-replication. 
> We use query 30 as example, our cluster can totally launch 108 executors. 
> When we run the query from 3-replication database, spark will start all 108 
> executors to execute the query.  When we run the query from Erasure Coding 
> database, spark will launch 108 executors and kill 72 executors due to 
> they’re idle, at last there are only 36 executors to execute the query which 
> leads to poor performance.
> This issue only happens when we enable dynamic allocations mechanism. When we 
> disable the dynamic allocations, Spark SQL query on EC has the similar 
> performance with on 3-replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3