date:20160817

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14676
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63974/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14676
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14676
  
**[Test build #63974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63974/consoleFull)**
 for PR 14676 at commit 
[`2e68438`](https://github.com/apache/spark/commit/2e6843844d126e2ba466fe6b34ea59b3b67942c7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14676
  
LGTM except 
https://github.com/apache/spark/pull/14676#discussion_r75248623, waiting 
feedback from @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14697: [SPARK-17124][SQL] RelationalGroupedDataset.agg should b...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14697
  
**[Test build #63978 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63978/consoleFull)**
 for PR 14697 at commit 
[`bd64ade`](https://github.com/apache/spark/commit/bd64ade6e3a82e9da55163e96303509275c56678).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14697: [SPARK-17124][SQL] RelationalGroupedDataset.agg s...

2016-08-17 Thread petermaxlee

GitHub user petermaxlee opened a pull request:

https://github.com/apache/spark/pull/14697

[SPARK-17124][SQL] RelationalGroupedDataset.agg should be order preserving 
and allow multiple expressions per column

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?
Added a test case in DataFrameAggregateSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petermaxlee/spark SPARK-17124

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14697.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14697


commit bd64ade6e3a82e9da55163e96303509275c56678
Author: petermaxlee 
Date:   2016-08-18T06:50:24Z

[SPARK-17124][SQL] RelationalGroupedDataset.agg should be order preserving 
and allow duplicate column names




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14683: [SPARK-16968]Add additional options in jdbc when creatin...

2016-08-17 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14683
  
Yes we should just need the documentation change here. You can review 
`master` to see if it has all the changes you expect from the last PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14672: [SPARK-17034][SQL] Minor code cleanup for UnresolvedOrdi...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14672
  
LGTM, pending jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14576: [SPARK-16391][SQL] Support partial aggregation for reduc...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14576
  
**[Test build #63977 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63977/consoleFull)**
 for PR 14576 at commit 
[`50ed0d8`](https://github.com/apache/spark/commit/50ed0d8b39c9305840b326aef034561be487e7c5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14690
  
> If a user queries such a table with predicates which prune that table's 
partitions, we would like to be able to answer that query without consulting 
partition metadata which are not involved in the query. 

When we read a partitioned hive table, we will retrieve all partition 
metadata from hive metastore and load them to driver memory. Yes, it's not so 
efficient and may blow up the dirver.

However, it only happens at first read, then these data will be cached. If 
you don't load all partition metadata at first read, how are you going to deal 
with the cache?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14683: [SPARK-16968]Add additional options in jdbc when creatin...

2016-08-17 Thread GraceH

Github user GraceH commented on the issue:

https://github.com/apache/spark/pull/14683
  
Sorry about my mistake. I will re-post one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14696
  
from the previous 
discussion(https://github.com/apache/spark/pull/14389#issuecomment-236591342):
> We also need to add some checks before applying the type widen rules, to 
avoid conflicting with DecimalPrecision, which defines some special rules for 
binary arithmetic about decimal type.

Have you tried this?

`push type coercion into each expression` is also in my plan, but it would 
be a very large change, we should have a design doc first and discuss it with 
some people. So I'd like to do this small refactor first, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14683: [SPARK-16968]Add additional options in jdbc when creatin...

2016-08-17 Thread GraceH

Github user GraceH commented on the issue:

https://github.com/apache/spark/pull/14683
  
Oops. @srowen I thought the previous pull request to be closed without 
merge. That is why I re-post that here. 
Do you mean we just need the document here, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75254588
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -144,16 +161,147 @@ private[spark] class HiveExternalCatalog(client: 
HiveClient, hadoopConf: Configu
 assert(tableDefinition.identifier.database.isDefined)
 val db = tableDefinition.identifier.database.get
 requireDbExists(db)
+verifyTableProperties(tableDefinition)
+
+if (tableDefinition.provider == Some("hive") || 
tableDefinition.tableType == VIEW) {
+  client.createTable(tableDefinition, ignoreIfExists)
+} else {
+  val tableProperties = tableMetadataToProperties(tableDefinition)
+
+  def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+tableDefinition.copy(
+  schema = new StructType,
+  partitionColumnNames = Nil,
+  bucketSpec = None,
+  properties = tableDefinition.properties ++ tableProperties)
+  }
+
+  def newHiveCompatibleMetastoreTable(serde: HiveSerDe, path: String): 
CatalogTable = {
+tableDefinition.copy(
+  storage = tableDefinition.storage.copy(
+locationUri = Some(new Path(path).toUri.toString),
+inputFormat = serde.inputFormat,
+outputFormat = serde.outputFormat,
+serde = serde.serde
+  ),
+  properties = tableDefinition.properties ++ tableProperties)
+  }
+
+  val qualifiedTableName = tableDefinition.identifier.quotedString
+  val maybeSerde = 
HiveSerDe.sourceToSerDe(tableDefinition.provider.get)
+  val maybePath = new 
CaseInsensitiveMap(tableDefinition.storage.properties).get("path")
+  val skipHiveMetadata = tableDefinition.storage.properties
+.getOrElse("skipHiveMetadata", "false").toBoolean
+
+  val (hiveCompatibleTable, logMessage) = (maybeSerde, maybePath) 
match {
--- End diff --

> Then, we used the generated BaseRelation to find whether this is a 
hiveCompatibleTable. If this is not a HadoopFsRelation, hiveCompatibleTable 
will be None.

No, previously we use the `maybeSerde` and `BaseRelation` to decide, so the 
`HadoopFsRelation` check is already done in `maybeSerde`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75254527
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

Yup I will add it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75254319
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

Then shall we add an assert? Or we may probably forget about it and go with 
`return null for empty relation without grouping key`, which is what the 
current code do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75253981
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

It's possible for us to support that in the future, but we can worry about 
it when we want to make this public?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75253703
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -144,16 +161,147 @@ private[spark] class HiveExternalCatalog(client: 
HiveClient, hadoopConf: Configu
 assert(tableDefinition.identifier.database.isDefined)
 val db = tableDefinition.identifier.database.get
 requireDbExists(db)
+verifyTableProperties(tableDefinition)
+
+if (tableDefinition.provider == Some("hive") || 
tableDefinition.tableType == VIEW) {
+  client.createTable(tableDefinition, ignoreIfExists)
+} else {
+  val tableProperties = tableMetadataToProperties(tableDefinition)
+
+  def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+tableDefinition.copy(
+  schema = new StructType,
+  partitionColumnNames = Nil,
+  bucketSpec = None,
+  properties = tableDefinition.properties ++ tableProperties)
+  }
+
+  def newHiveCompatibleMetastoreTable(serde: HiveSerDe, path: String): 
CatalogTable = {
+tableDefinition.copy(
+  storage = tableDefinition.storage.copy(
+locationUri = Some(new Path(path).toUri.toString),
+inputFormat = serde.inputFormat,
+outputFormat = serde.outputFormat,
+serde = serde.serde
+  ),
+  properties = tableDefinition.properties ++ tableProperties)
+  }
+
+  val qualifiedTableName = tableDefinition.identifier.quotedString
+  val maybeSerde = 
HiveSerDe.sourceToSerDe(tableDefinition.provider.get)
+  val maybePath = new 
CaseInsensitiveMap(tableDefinition.storage.properties).get("path")
+  val skipHiveMetadata = tableDefinition.storage.properties
+.getOrElse("skipHiveMetadata", "false").toBoolean
+
+  val (hiveCompatibleTable, logMessage) = (maybeSerde, maybePath) 
match {
--- End diff --

Previously, we create a DataSource and resolve it by calling 
`dataSource.resolveRelation`. (FYI, the `resolveRelation` consumes 
user-specified `options`.) Then, we used the generated `BaseRelation` to find 
whether this is a `hiveCompatibleTable`. If this is not a `HadoopFsRelation`, 
`hiveCompatibleTable` will be None.

Now, the decision is based on whether the user-specified options has a 
`path` property or not. This is not always true. Right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog....

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14618
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog....

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14618
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63972/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog....

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14618
  
**[Test build #63972 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63972/consoleFull)**
 for PR 14618 at commit 
[`fc2dbd9`](https://github.com/apache/spark/commit/fc2dbd972acd4020ca848f1a6d727a511aa70a8f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/14696
  
@cloud-fan  this actually broke decimal precision.

I'm starting to think that it would be better to push type coercion into 
each expression, and then the arithmetic can create special cases for decimal 
types before calling the functions provided here. It would be a much larger 
change though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75252926
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #14672: [SPARK-17034][SQL] Minor code cleanup for UnresolvedOrdi...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14672
  
**[Test build #63976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63976/consoleFull)**
 for PR 14672 at commit 
[`2eb02c1`](https://github.com/apache/spark/commit/2eb02c178fefe906e01ab4a98283de2c8a0fcc36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14672: [SPARK-17034][SQL] Minor code cleanup for Unresol...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14672#discussion_r75252322
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinals.scala
 ---
@@ -27,22 +27,21 @@ import 
org.apache.spark.sql.catalyst.trees.CurrentOrigin.withOrigin
 /**
  * Replaces ordinal in 'order by' or 'group by' with UnresolvedOrdinal 
expression.
  */
-class UnresolvedOrdinalSubstitution(conf: CatalystConf) extends 
Rule[LogicalPlan] {
-  private def isIntegerLiteral(sorter: Expression) = 
IntegerIndex.unapply(sorter).nonEmpty
+class SubstituteUnresolvedOrdinals(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  private def isIntLiteral(sorter: Expression) = 
IntegerIndex.unapply(sorter).nonEmpty
--- End diff --

Good idea. I remove IntegerIndex.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14696
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63973/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14696
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14696
  
**[Test build #63973 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63973/consoleFull)**
 for PR 14696 at commit 
[`9df4551`](https://github.com/apache/spark/commit/9df455107cf89b590ca0bcac807ea8671ccab344).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63970/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14676
  
**[Test build #63975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63975/consoleFull)**
 for PR 14676 at commit 
[`fb9de34`](https://github.com/apache/spark/commit/fb9de341aa5c43907ab4a51a9187434f13defcd3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14155
  
**[Test build #63970 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63970/consoleFull)**
 for PR 14155 at commit 
[`96d57b6`](https://github.com/apache/spark/commit/96d57b665ac65750eb5c6f9757e5827ea9c14ca4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63969/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75251858
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
--- End diff --

I added some comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14155
  
**[Test build #63969 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63969/consoleFull)**
 for PR 14155 at commit 
[`502fd63`](https://github.com/apache/spark/commit/502fd6350edc55537ea99a374abc1aad130aceb1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14695: [SPARK-17117][SQL] 1 / NULL should not fail analysis

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14695
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14695: [SPARK-17117][SQL] 1 / NULL should not fail analysis

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14695
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63967/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13873
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75251612
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

nvm, it sounds like the `write` API is just called by CTAS and the save API 
of DataFrameWriter. It is OK. Let me read it again and check if we might have 
an issue for `options` in the CREATE Data Source Table command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13873
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63968/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14673: [SPARK-15083] [Web UI] History Server can OOM due to unl...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14673
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63971/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13873
  
**[Test build #63968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63968/consoleFull)**
 for PR 13873 at commit 
[`f30d7d3`](https://github.com/apache/spark/commit/f30d7d32b084d5fa95e36be037899011e99b51a5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14695: [SPARK-17117][SQL] 1 / NULL should not fail analysis

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14695
  
**[Test build #63967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63967/consoleFull)**
 for PR 14695 at commit 
[`a946269`](https://github.com/apache/spark/commit/a946269811540d6cdb2237c62f095f847b461cee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14673: [SPARK-15083] [Web UI] History Server can OOM due to unl...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14673
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14673: [SPARK-15083] [Web UI] History Server can OOM due to unl...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14673
  
**[Test build #63971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63971/consoleFull)**
 for PR 14673 at commit 
[`014db4c`](https://github.com/apache/spark/commit/014db4c88b6b8a56a8a7a8197c27ac0b6e02f1a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [WIP][SPARK-17017][MLLIB] add a chiSquare Selector based...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14597
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75251290
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

@cloud-fan How about the write path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75251168
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

That line is just putting the `options` in the storage properties. It works 
for `path`, but the external data source connectors might [pass some parameters 
into 
`createRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L443).
 I think `option` is a critical parameter-passing channel for the external data 
source connectors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...

2016-08-17 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @srowen, I have added the parameter to control the feature selection 
type.
The usage is like this: 
**var selector = new ChiSqSelector()
var model = selector.fit(df) // by default, the selector is selection 
numTopFeatures (50)
var newModel = selector.selectKBest(10), or var newModel = 
selector.selectPercentile(5), or,,**
You can fit the DataFrame one time, and generate the model multi times. 

And the indices is sort in the model internally as we have discussed. 

For pass the p-value to the model function, this update does not include 
it. Because for the KBest and Percentile selection, the fit function uses 
ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses 
ChiSqTestResult.p-value.  So it maybe better to pass ChiSqTestResult to the 
model and expose to the caller. And I think it is better to submit another PR 
for  "pass value to model and expose to the caller" problem. Because much codes 
will be changed for this problem, includes which data should be passed to the 
model, how  to save the model, how to test the model.   




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-17 Thread mpjlu

GitHub user mpjlu reopened a pull request:

https://github.com/apache/spark/pull/14597

[SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate 
(FPR) test

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests. False Positive Rate (FPR) is a popular univariate 
statistical test for feature selection. We add a chiSquare Selector based on 
False Positive Rate (FPR) test in this PR, like it is implemented in 
scikit-learn. 

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection


## How was this patch tested?

Add Scala ut



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fprChiSquare

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14597


commit 2adebe8de3881509e510fc518c562d1141ccd0ef
Author: Peng, Meng 
Date:   2016-08-10T05:40:18Z

add a chiSquare Selector based on False Positive Rate (FPR) test

commit 04053ca207ef4aa955eddc3e65d09a4e03db6292
Author: Peng, Meng 
Date:   2016-08-11T07:10:43Z

Merge remote-tracking branch 'origin/master' into fprChiSquare

commit 7623563884355a04867ce5271baa286f65180e62
Author: Peng, Meng 
Date:   2016-08-16T13:36:11Z

Configure the ChiSqSelector to reuse ChiSqTestResult by numTopFeatures, 
Percentile, and Fpr selector

commit 3d6aecb8441503c9c3d62a2d8a3d48824b9d6637
Author: Peng, Meng 
Date:   2016-08-17T02:34:59Z

Config the ChiSqSelector to reuse the ChiSqTestResult by KBest, Percentile 
and FPR selector

commit 026ac85dfa190707891b694f40e737f22f9b4bd5
Author: Peng, Meng 
Date:   2016-08-17T02:43:45Z

Merge branch 'master' into fprChiSquare2

commit 5305709c9d4029186318b99fa9c7c483897aa653
Author: Peng, Meng 
Date:   2016-08-17T09:59:16Z

add Since annotation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75250973
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75250996
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

I put the options in `CatalogStorageFormat.properties`, and when the table 
is read back, we will get the storage.properties as the data source options for 
create relation, see 
https://github.com/apache/spark/pull/14155/files#diff-d99813bd5bbc18277e4090475e4944cfR214


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75250883
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

Is it different from what we do at line 
https://github.com/apache/spark/pull/14155/files/96d57b665ac65750eb5c6f9757e5827ea9c14ca4#diff-945e51801b84b92da242fcb42f83f5f5R98?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14648: [SPARK-16995][SQL] TreeNodeException when flat mapping R...

2016-08-17 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14648
  
Thanks for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

2016-08-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14155#discussion_r75250702
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -97,16 +92,17 @@ case class CreateDataSourceTableCommand(
   }
 }
 
-CreateDataSourceTableUtils.createDataSourceTable(
-  sparkSession = sparkSession,
-  tableIdent = tableIdent,
+val table = CatalogTable(
+  identifier = tableIdent,
+  tableType = if (isExternal) CatalogTableType.EXTERNAL else 
CatalogTableType.MANAGED,
+  storage = CatalogStorageFormat.empty.copy(properties = 
optionsWithPath),
   schema = dataSource.schema,
-  partitionColumns = partitionColumns,
-  bucketSpec = bucketSpec,
-  provider = provider,
-  options = optionsWithPath,
--- End diff --

It sounds like we are not following the previous behaviors. `options` might 
be consumed by the external Data Source implementors. `options` is not only 
used for specifying `path`, but also used for a channel to pass extra 
parameters to the data source.

I checked the existing implementation of `createDataSourceTable`. We [pass 
the original `options` into the constructor of 
`DataSource`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L336).
 Then, [the `write` API will pass the `option` to 
`createRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L443).
 

How about adding it as an independent field in `CatalogTable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75250554
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75250470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

I think it makes sense to support `reduce group` without grouping key, so 
it may happen in the future. Besides, it's not a lot of work, we just need to 
decide the expected behaviour.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14648: [SPARK-16995][SQL] TreeNodeException when flat ma...

2016-08-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14648


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14648: [SPARK-16995][SQL] TreeNodeException when flat mapping R...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14648
  
thanks, merging to master and 2.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14672: [SPARK-17034][SQL] Minor code cleanup for Unresol...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14672#discussion_r75250012
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinals.scala
 ---
@@ -27,22 +27,21 @@ import 
org.apache.spark.sql.catalyst.trees.CurrentOrigin.withOrigin
 /**
  * Replaces ordinal in 'order by' or 'group by' with UnresolvedOrdinal 
expression.
  */
-class UnresolvedOrdinalSubstitution(conf: CatalystConf) extends 
Rule[LogicalPlan] {
-  private def isIntegerLiteral(sorter: Expression) = 
IntegerIndex.unapply(sorter).nonEmpty
+class SubstituteUnresolvedOrdinals(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  private def isIntLiteral(sorter: Expression) = 
IntegerIndex.unapply(sorter).nonEmpty
--- End diff --

as we are cleaning up the code, shall we also remove `IntegerIndex`? It 
became unnecessary after we make `-1` a literal instead of `UnaryMinus`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75249808
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
--- End diff --

This looks tricky to me, as ideally `foldable` matches the semantic better.

Actually we are making assumptions here, i.e. in the case of inline table, 
evaluable always mean foldable, because `UnresolvedInlineTable` can't resolve 
`UnresolvedAttribute` to `AttributeReference` as it's a leaf node.

We should either document this, or not support rand, cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75249521
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala
 ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Literal, Rand}
+import org.apache.spark.sql.catalyst.expressions.aggregate.Count
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.types.LongType
+
+/**
+ * Unit tests for [[ResolveInlineTables]]. Note that there are also test 
cases defined in
+ * end-to-end tests (in sql/core module) for verifying the correct error 
messages are shown
+ * in negative cases.
+ */
+class ResolveInlineTablesSuite extends PlanTest with BeforeAndAfter {
+
+  private def lit(v: Any): Literal = Literal(v)
+
+  test("validate inputs are foldable") {
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(lit(1)
+
+// nondeterministic (rand) should be fine
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Rand(1)
+
+// aggregate should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Count(lit(1))
+}
+
+// unresolved attribute should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), 
Seq(Seq(UnresolvedAttribute("A")
--- End diff --

But how would a user construct an AttributeReference?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75249457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
+if (row.size != numCols) {
+  table.failAnalysis(s"expected $numCols columns but found 
${row.size} columns in row $ri")
+}
+  }
+
+  if (table.names.size != numCols) {
+table.failAnalysis(s"expected ${table.names.size} columns but 
found $numCols in first row")
+  }
+}
+  }
+
+  /**
+   * Convert a valid (with right shape and foldable inputs) 
[[UnresolvedInlineTable]]
+   * into a [[LocalRelation]].
+   *
+   * This function attempts to coerce inputs into consistent types.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def convert(table: UnresolvedInlineTable): 
LocalRelation = {
+val numCols = table.rows.head.size
+
+// For each column, traverse all the values and find a common data 
type.
+val targetTypes = table.rows.transpose.zip(table.names).map { case 
(column, name) =>
+  val inputTypes = column.map(_.dataType)
+  
TypeCoercion.findWiderTypeWithoutStringPromotion(inputTypes).getOrElse {
--- End diff --

I don't have a strong preference, cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75249365
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala
 ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Literal, Rand}
+import org.apache.spark.sql.catalyst.expressions.aggregate.Count
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.types.LongType
+
+/**
+ * Unit tests for [[ResolveInlineTables]]. Note that there are also test 
cases defined in
+ * end-to-end tests (in sql/core module) for verifying the correct error 
messages are shown
+ * in negative cases.
+ */
+class ResolveInlineTablesSuite extends PlanTest with BeforeAndAfter {
+
+  private def lit(v: Any): Literal = Literal(v)
+
+  test("validate inputs are foldable") {
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(lit(1)
+
+// nondeterministic (rand) should be fine
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Rand(1)
+
+// aggregate should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Count(lit(1))
+}
+
+// unresolved attribute should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), 
Seq(Seq(UnresolvedAttribute("A")
--- End diff --

the `Add` will be resolved and evaluable, but not foldable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75249341
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala
 ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Literal, Rand}
+import org.apache.spark.sql.catalyst.expressions.aggregate.Count
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.types.LongType
+
+/**
+ * Unit tests for [[ResolveInlineTables]]. Note that there are also test 
cases defined in
+ * end-to-end tests (in sql/core module) for verifying the correct error 
messages are shown
+ * in negative cases.
+ */
+class ResolveInlineTablesSuite extends PlanTest with BeforeAndAfter {
+
+  private def lit(v: Any): Literal = Literal(v)
+
+  test("validate inputs are foldable") {
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(lit(1)
+
+// nondeterministic (rand) should be fine
+ResolveInlineTables.validateInputEvaluable(
+  UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Rand(1)
+
+// aggregate should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), Seq(Seq(Count(lit(1))
+}
+
+// unresolved attribute should not work
+intercept[AnalysisException] {
+  ResolveInlineTables.validateInputEvaluable(
+UnresolvedInlineTable(Seq("c1", "c2"), 
Seq(Seq(UnresolvedAttribute("A")
--- End diff --

how about `UnresolvedInlineTable(Seq("c1", "c2"), 
Seq(Seq(AttributeReference("A") + 1)))`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63966/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14155
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-17 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/13796
  
I go through the PR again, and it's in a very good shape. Only couple minor 
issues needed to be addressed. Thank you @sethah for the great work. This will 
be a big feature in Spark 2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14693: [SPARK-17113][Shuffle] Job failure due to Executor OOM i...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14693
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63964/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14155
  
**[Test build #63966 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63966/consoleFull)**
 for PR 14155 at commit 
[`263d7c3`](https://github.com/apache/spark/commit/263d7c38d60266db96d65032e53690a57f111a4f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14693: [SPARK-17113][Shuffle] Job failure due to Executor OOM i...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14693
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14693: [SPARK-17113][Shuffle] Job failure due to Executor OOM i...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14693
  
**[Test build #63964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63964/consoleFull)**
 for PR 14693 at commit 
[`8581659`](https://github.com/apache/spark/commit/85816590f141d1785b2786610d29523ce249c59f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75249124
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75249109
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75249096
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75249042
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #14687: [SPARK-17107][SQL] Remove redundant pushdown rule for Un...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14687
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63965/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75248950
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #14687: [SPARK-17107][SQL] Remove redundant pushdown rule for Un...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14687
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14676: [SPARK-16947][SQL] Support type coercion and foldable ex...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14676
  
**[Test build #63974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63974/consoleFull)**
 for PR 14676 at commit 
[`2e68438`](https://github.com/apache/spark/commit/2e6843844d126e2ba466fe6b34ea59b3b67942c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14687: [SPARK-17107][SQL] Remove redundant pushdown rule for Un...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14687
  
**[Test build #63965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63965/consoleFull)**
 for PR 14687 at commit 
[`f840ccb`](https://github.com/apache/spark/commit/f840ccbb43a34aa0c7469027e969eb45b3ae7d33).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248833
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
+if (row.size != numCols) {
+  table.failAnalysis(s"expected $numCols columns but found 
${row.size} columns in row $ri")
+}
+  }
+
+  if (table.names.size != numCols) {
+table.failAnalysis(s"expected ${table.names.size} columns but 
found $numCols in first row")
+  }
+}
+  }
+
+  /**
+   * Convert a valid (with right shape and foldable inputs) 
[[UnresolvedInlineTable]]
+   * into a [[LocalRelation]].
+   *
+   * This function attempts to coerce inputs into consistent types.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def convert(table: UnresolvedInlineTable): 
LocalRelation = {
+val numCols = table.rows.head.size
+
+// For each column, traverse all the values and find a common data 
type.
+val targetTypes = table.rows.transpose.zip(table.names).map { case 
(column, name) =>
+  val inputTypes = column.map(_.dataType)
+  
TypeCoercion.findWiderTypeWithoutStringPromotion(inputTypes).getOrElse {
--- End diff --

Postgres doesn't allow it. We can choose to be consistent with union though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75248788
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75248762
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75248754
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248623
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
+if (row.size != numCols) {
+  table.failAnalysis(s"expected $numCols columns but found 
${row.size} columns in row $ri")
+}
+  }
+
+  if (table.names.size != numCols) {
+table.failAnalysis(s"expected ${table.names.size} columns but 
found $numCols in first row")
+  }
+}
+  }
+
+  /**
+   * Convert a valid (with right shape and foldable inputs) 
[[UnresolvedInlineTable]]
+   * into a [[LocalRelation]].
+   *
+   * This function attempts to coerce inputs into consistent types.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def convert(table: UnresolvedInlineTable): 
LocalRelation = {
+val numCols = table.rows.head.size
+
+// For each column, traverse all the values and find a common data 
type.
+val targetTypes = table.rows.transpose.zip(table.names).map { case 
(column, name) =>
+  val inputTypes = column.map(_.dataType)
+  
TypeCoercion.findWiderTypeWithoutStringPromotion(inputTypes).getOrElse {
--- End diff --

Can you check with other databases? Should we do string promotion for 
inline table? FYI expressions in `Union` can promote to string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14389: [SPARK-16714][SQL] Refactor type widening for con...

2016-08-17 Thread petermaxlee

Github user petermaxlee closed the pull request at:

https://github.com/apache/spark/pull/14389


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14389: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/14389
  
Sorry that it has taken this long. I have submitted a work in progress pull 
request at https://github.com/apache/spark/pull/14696

Going to close this one and continue the work there, since it is a fairly 
different pull request.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248513
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
--- End diff --

That's a good idea. Let me do that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248490
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/inline-table.sql ---
@@ -0,0 +1,48 @@
+
+-- single row, without table and column alias
+select * from values ("one", 1);
+
+-- single row, without column alias
+select * from values ("one", 1) as data;
+
+-- single row
+select * from values ("one", 1) as data(a, b);
+
+-- single column multiple rows
+select * from values 1, 2, 3 as data(a);
+
+-- two rows
--- End diff --

nit: 3 rows


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-17 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r75248492
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,611 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic (softmax) regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic (softmax) regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248427
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
+if (row.size != numCols) {
+  table.failAnalysis(s"expected $numCols columns but found 
${row.size} columns in row $ri")
+}
+  }
+
+  if (table.names.size != numCols) {
+table.failAnalysis(s"expected ${table.names.size} columns but 
found $numCols in first row")
+  }
+}
+  }
+
+  /**
+   * Convert a valid (with right shape and foldable inputs) 
[[UnresolvedInlineTable]]
+   * into a [[LocalRelation]].
+   *
+   * This function attempts to coerce inputs into consistent types.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def convert(table: UnresolvedInlineTable): 
LocalRelation = {
+val numCols = table.rows.head.size
+
+// For each column, traverse all the values and find a common data 
type.
+val targetTypes = table.rows.transpose.zip(table.names).map { case 
(column, name) =>
+  val inputTypes = column.map(_.dataType)
+  
TypeCoercion.findWiderTypeWithoutStringPromotion(inputTypes).getOrElse {
+table.failAnalysis(s"incompatible types found in column $name for 
inline table")
+  }
+}
+assert(targetTypes.size == table.names.size)
--- End diff --

asserts are not meant to be user facing. They are meant to be defensive 
against programming errors (i.e. bugs in Spark).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-ma

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248346
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
--- End diff --

This was suggested by @hvanhovell.

I think private functions are still meant to be private. This is only 
package visible for the purpose of testing. That is to say, I don't expect 
developers to be calling this function either.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14696: [SPARK-16714][SQL] Refactor type widening for consistenc...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14696
  
**[Test build #63973 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63973/consoleFull)**
 for PR 14696 at commit 
[`9df4551`](https://github.com/apache/spark/commit/9df455107cf89b590ca0bcac807ea8671ccab344).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14656
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...

2016-08-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14656
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63963/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread petermaxlee

Github user petermaxlee commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
--- End diff --

If we check foldable rand() wouldn't work.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...

2016-08-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14656
  
**[Test build #63963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63963/consoleFull)**
 for PR 14656 at commit 
[`7ebd563`](https://github.com/apache/spark/commit/7ebd563fca73ae4a4e05970709f334a4d09b5ff1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14676: [SPARK-16947][SQL] Support type coercion and fold...

2016-08-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14676#discussion_r75248289
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Cast, 
InterpretedProjection, Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * An analyzer rule that replaces [[UnresolvedInlineTable]] with 
[[LocalRelation]].
+ */
+object ResolveInlineTables extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case table: UnresolvedInlineTable if table.expressionsResolved =>
+  validateInputDimension(table)
+  validateInputEvaluable(table)
+  convert(table)
+  }
+
+  /**
+   * Validates that all inline table data are foldable expressions.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputEvaluable(table: 
UnresolvedInlineTable): Unit = {
+table.rows.foreach { row =>
+  row.foreach { e =>
+if (!e.resolved || e.isInstanceOf[Unevaluable]) {
+  e.failAnalysis(s"cannot evaluate expression ${e.sql} in inline 
table definition")
+}
+  }
+}
+  }
+
+  /**
+   * Validates the input data dimension:
+   * 1. All rows have the same cardinality.
+   * 2. The number of column aliases defined is consistent with the number 
of columns in data.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def validateInputDimension(table: 
UnresolvedInlineTable): Unit = {
+if (table.rows.nonEmpty) {
+  val numCols = table.rows.head.size
+  table.rows.zipWithIndex.foreach { case (row, ri) =>
+if (row.size != numCols) {
+  table.failAnalysis(s"expected $numCols columns but found 
${row.size} columns in row $ri")
+}
+  }
+
+  if (table.names.size != numCols) {
+table.failAnalysis(s"expected ${table.names.size} columns but 
found $numCols in first row")
+  }
+}
+  }
+
+  /**
+   * Convert a valid (with right shape and foldable inputs) 
[[UnresolvedInlineTable]]
+   * into a [[LocalRelation]].
+   *
+   * This function attempts to coerce inputs into consistent types.
+   *
+   * This is package visible for unit testing.
+   */
+  private[analysis] def convert(table: UnresolvedInlineTable): 
LocalRelation = {
+val numCols = table.rows.head.size
+
+// For each column, traverse all the values and find a common data 
type.
+val targetTypes = table.rows.transpose.zip(table.names).map { case 
(column, name) =>
+  val inputTypes = column.map(_.dataType)
+  
TypeCoercion.findWiderTypeWithoutStringPromotion(inputTypes).getOrElse {
+table.failAnalysis(s"incompatible types found in column $name for 
inline table")
+  }
+}
+assert(targetTypes.size == table.names.size)
--- End diff --

it's duplicated, `validateInputDimension` already guarantees 
`table.names.size` is equal to number of column


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h.

1 2 3 4 5 6 7 8 >

1 - 100 of 731 matches

Mail list logo