[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17527


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110408142
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -910,7 +911,8 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // When Hive rename partition for managed tables, it will create the 
partition location with
 // a default path generate by the new spec with lower cased partition 
column names. This is
 // unexpected and we need to rename them manually and alter the 
partition location.
-val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
+val hasUpperCasePartitionColumn =
+partitionColumnNames.exists(col => col.toLowerCase != col)
--- End diff --

@srowen, it seems mistakenly a newline inserted for this and the one below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110395985
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

I don't think there's a right answer... for consistency it seems like we 
should treat keys uniformly and my gut says that we want those to be 
locale-neutral because they are generally never something localized or 
locale-specific.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110394983
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

I am actually fine if you are fond of leaving it. I really think such cases 
are inappropriate and hardly possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110392436
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Really, it sounds nitpicking though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110392351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Yes, hardly possible really.. not user but a developer could use that 
Turkish character for a external data source..  (external data source could use 
this up to my knowledge, see 
[interfaces.scala#L73-L74](https://github.com/apache/spark/blob/04ee8cf633e17b6bf95225a8dd77bf2e06980eb3/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L73-L74)
 and 
[DataSource.scala#L306](https://github.com/apache/spark/blob/096df6d933c5326e5782aa8c5de842a0800eb369/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L306)).

For example, XML datasource is extending this - 
[DefaultSource.scala#L30](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/DefaultSource.scala#L30).
 Though.. one case I finally come up is use the map as a kind of arbitrary key 
and value. 

Someone opened a PR to Spark XML 
`https://github.com/databricks/spark-xml/pull/247` and this supports to use 
arbitrary key and value for writing out attributes to the root element. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110382743
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Yes, though would these keys ever involve a user-defined key string? I 
think they're all Spark, Hive properties, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110338963
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -82,8 +84,8 @@ case class OptimizeMetadataOnlyQuery(
   private def getPartitionAttrs(
   partitionColumnNames: Seq[String],
   relation: LogicalPlan): Seq[Attribute] = {
-val partColumns = partitionColumnNames.map(_.toLowerCase).toSet
-relation.output.filter(a => partColumns.contains(a.name.toLowerCase))
+val partColumns = 
partitionColumnNames.map(_.toLowerCase(Locale.ROOT)).toSet
+relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase(Locale.ROOT)))
--- End diff --

Yes, the question is whether case-insensitive operations on schema 
elements, not just data, should be locale-neutral. 

The flip side to this argument is that a case-insensitive operation on a 
schema with the character "I" or "i" suddenly doesn't work as expected on a 
machine with the Turkish locale.

I don't know which is rarer: schema with these special characters, or a 
machine with this particular Locale.

I think I'd probably err on the side of not changing the behavior. I'll 
further revert the changes that touch things like column and table names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317549
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -328,7 +329,7 @@ object PartitioningUtils {
 } else {
   // TODO: Selective case sensitivity.
   val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase(Locale.ROOT))).distinct
--- End diff --

I think this might cause a similar problem with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110298557
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -396,7 +397,7 @@ object PartitioningAwareFileIndex extends Logging {
   sessionOpt: Option[SparkSession]): Seq[FileStatus] = {
 logTrace(s"Listing $path")
 val fs = path.getFileSystem(hadoopConf)
-val name = path.getName.toLowerCase
+val name = path.getName.toLowerCase(Locale.ROOT)
--- End diff --

(This variable seems not used.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala 
---
@@ -222,7 +225,7 @@ case class PreprocessTableCreation(sparkSession: 
SparkSession) extends Rule[Logi
 val columnNames = if 
(sparkSession.sessionState.conf.caseSensitiveAnalysis) {
   schema.map(_.name)
 } else {
-  schema.map(_.name.toLowerCase)
+  schema.map(_.name.toLowerCase(Locale.ROOT))
--- End diff --

Maybe, it is not good to point the similar instances all but let me just 
point this out as the change looks big. This maybe the similar instances with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317441
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -128,7 +128,8 @@ object PartitioningUtils {
   //   "hdfs://host:9000/invalidPath"
   //   "hdfs://host:9000/path"
   // TODO: Selective case sensitivity.
-  val discoveredBasePaths = 
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase())
+  val discoveredBasePaths =
+  
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase(Locale.ROOT))
--- End diff --

I am worried of this one too. It sounds the path could contains Turkish 
characters I guess..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314669
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringKeyHashMap.scala
 ---
@@ -25,7 +27,7 @@ object StringKeyHashMap {
   def apply[T](caseSensitive: Boolean): StringKeyHashMap[T] = if 
(caseSensitive) {
 new StringKeyHashMap[T](identity)
   } else {
-new StringKeyHashMap[T](_.toLowerCase)
+new StringKeyHashMap[T](_.toLowerCase(Locale.ROOT))
--- End diff --

This only seems used in `SimpleFunctionRegistry`. I don't think we have 
Turkish characters in function names and I don't think users will use other 
language in the function names. So probably it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110315394
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -82,8 +84,8 @@ case class OptimizeMetadataOnlyQuery(
   private def getPartitionAttrs(
   partitionColumnNames: Seq[String],
   relation: LogicalPlan): Seq[Attribute] = {
-val partColumns = partitionColumnNames.map(_.toLowerCase).toSet
-relation.output.filter(a => partColumns.contains(a.name.toLowerCase))
+val partColumns = 
partitionColumnNames.map(_.toLowerCase(Locale.ROOT)).toSet
+relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase(Locale.ROOT)))
--- End diff --

I am little bit worried of this change likewise. For example,

Before

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala>  val partColumns = Seq("I").map(_.toLowerCase).toSet
partColumns: scala.collection.immutable.Set[String] = Set(ı)

scala> Seq("a", "ı", "I").filter(a => partColumns.contains(a.toLowerCase))
res13: Seq[String] = List(ı, I)
```

After

```scala
scala> val partColumns = 
Seq("I").map(_.toLowerCase(java.util.Locale.ROOT)).toSet
partColumns: scala.collection.immutable.Set[String] = Set(i)

scala> Seq("a", "ı", "I").filter(a => 
partColumns.contains(a.toLowerCase(java.util.Locale.ROOT)))
res14: Seq[String] = List(I)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314541
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Maybe nitpicking and it is rarely possible I guess. However, up to my 
knowledge, this will affect the options users set, `spark.read.option(...)`. 
Namely, I think these case possible as below:

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala> val originalMap = Map("ı" -> 1, "I" -> 2)
originalMap: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, I -> 
2)
```

Before

```scala
scala> originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase))
res6: scala.collection.immutable.Map[String,Int] = Map(ı -> 2)
```

After

```scala
scala> originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(java.util.Locale.ROOT)))
res7: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, i -> 2)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317272
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -52,7 +54,11 @@ case class HadoopFsRelation(
 
   val schema: StructType = {
 val getColName: (StructField => String) =
-  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name 
else _.name.toLowerCase
+  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) {
+_.name
+  } else {
+_.name.toLowerCase(Locale.ROOT)
+  }
--- End diff --

I think we should leave this out. It seems `dataSchema` means the schema 
from the source which is exposed to users. I think this could cause a problem. 
For example as below:

Before

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("ı", StringType) :: 
Nil)

val getColName: (StructField => String) = _.name.toLowerCase

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(I)
```

After

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("ı", StringType) :: 
Nil)

val getColName: (StructField => String) = 
_.name.toLowerCase(java.util.Locale.ROOT)

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(ı, I)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110064613
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -407,7 +408,7 @@ public UTF8String toLowerCase() {
   }
 
   private UTF8String toLowerCaseSlow() {
-return fromString(toString().toLowerCase());
+return fromString(toString().toLowerCase(Locale.ROOT));
--- End diff --

It seems there are few cases exposed to users. For example, this seems used 
in `Lower` and `InitCap` expressions, where, up to my knowledge, the lower 
cased ones are exposed into users.

In this case, IIRC the same argument (consistent base + options for 
variants approach) is applied in 
[SPARK-18076](https://issues.apache.org/jira/browse/SPARK-18076). So, I think 
it might be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-04 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/17527

[SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish 
locale bug" causes Spark problems

## What changes were proposed in this pull request?

Add Locale.ROOT to all calls to String `toLowerCase`, `toUpperCase`, to 
avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish 
locale problem").

The change looks large but it is just adding `Locale.ROOT` (the locale with 
no country or language specified) to every call to these methods.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-20156

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17527


commit 59194388d8de07d258a1a952b2cc52c57962c77a
Author: Sean Owen 
Date:   2017-04-04T08:47:18Z

Add Locale.ROOT to all calls to String toLowerCase, toUpperCase, to avoid 
inadvertent locale-sensitive variation in behavior (aka the "Turkish locale 
problem")

commit ab5995c18595195df51b43c6358ed761de0181a4
Author: Sean Owen 
Date:   2017-04-04T08:51:30Z

Revert inadvertent change to R file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org